[FEATURE]: Add contamination detection and config loader#60
[FEATURE]: Add contamination detection and config loader#60DhruvK278 wants to merge 6 commits intoAOSSIE-Org:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a configurable contamination-detection subsystem: resolve benchmarks (CLI / YAML / defaults), load benchmark texts (Hugging Face or local), build/serialize an n‑gram Bloom filter, perform two‑stage contamination checks (Bloom + exact verify), redact contaminated XML chunks during extraction, and record contamination metadata in dataset manifests. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI/Config
participant Config as load_config()
participant Fetch as fetch_benchmarks()
participant HF as HuggingFace Datasets
participant Local as Local Files
participant Bloom as build_bloom_filter()
participant Extract as extract_text_from_xml()
participant Check as check_contamination()
participant Manifest as generate_manifest()
CLI->>Config: load_config(cli_benchmarks)
Config->>Fetch: provide BenchmarkConfig
Fetch->>HF: optionally load HF datasets
Fetch->>Local: optionally load local files (JSONL/CSV)
Fetch-->>Bloom: return benchmark_texts
Bloom->>Bloom: build or load cached Bloom filter (serialize & metadata)
Bloom-->>Extract: provide bloom_filter, benchmarks_used
Extract->>Check: check_contamination(chunk, bloom_filter)
Check->>Check: n-gram generation + Bloom lookup
Check->>Check: exact substring verification against benchmark_texts
Check-->>Extract: contamination result
Extract->>Extract: redact contaminated chunks, count redactions
Extract-->>Manifest: pass contamination metadata (benchmarks_used, redacted_chunks)
Manifest->>Manifest: write dataset_manifest.json with contamination fields
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/contamination.py`:
- Around line 153-158: The cached Bloom filter loading uses filter_path and
Bloom.load_bytes(raw, _bloom_hash) without validating the inputs that produced
it, so different runs can reuse a stale filter; update the cache handling to
validate metadata (benchmark sources, config.n, normalization/hash version)
before returning: when saving a Bloom (created in the code paths that call
Bloom.create or Bloom.add), persist a small metadata blob (inputs hash,
config.n, normalization/version) alongside the filter file, and when loading via
the block that uses filter_path and Bloom.load_bytes, read and compare that
metadata to current values and only return the loaded bloom if they
match—otherwise delete or ignore the cache and rebuild the Bloom to ensure
Bloom.load_bytes and _bloom_hash are only used for matching input state.
- Around line 65-69: The current except block around load_dataset silences
dataset/network/auth failures by returning texts; instead propagate the failure:
remove the "return texts" and either re-raise the caught exception (raise) or
wrap and raise a descriptive custom exception (e.g., DatasetLoadError) so the
run fails, or record the per-source failure into the manifest/state before
raising; update the block that references load_dataset, dataset_id,
logger.warning and texts in contamination.py to either propagate the error or
persist a failure entry so callers can handle it rather than getting an empty
corpus.
- Line 66: The call to load_dataset(dataset_id, trust_remote_code=True) uses
untrusted user input and enables remote code execution; remove the hardcoded
trust_remote_code=True and instead gate it behind an explicit opt-in flag (e.g.
a config/CLI boolean like allow_trust_remote_code or trust_remote_code)
defaulting to False, then pass trust_remote_code only when that flag is true;
update the code path that calls load_dataset (reference: load_dataset and
dataset_id in contamination.py) to consult the new flag and ensure users must
explicitly enable it before remote code is allowed.
In `@openverifiablellm/utils.py`:
- Around line 131-136: The manifest generation currently omits contamination
metadata when benchmarks_used is None, making redacted runs indistinguishable
from unchecked runs; update the calls to generate_manifest (and the
generate_manifest implementation if needed) so that when contamination checking
is enabled you always pass a structured contamination/config/result block (e.g.,
a contamination or contamination_metadata argument) even if benchmarks_used is
None, and ensure generate_manifest writes that block to the manifest; apply the
same change to the other call sites noted (the call around lines 169-171) so the
manifest always contains the contamination check config and outcome when
checking is performed.
- Around line 86-90: Current lazy-import logic silently disables contamination
checking when only one of bloom_filter or benchmark_texts is provided; update
the block that sets check_contamination to validate configuration and fail fast:
if exactly one of bloom_filter or benchmark_texts is None (i.e., (bloom_filter
is None) != (benchmark_texts is None)), raise a clear configuration error
(ValueError or a project-specific ConfigurationError) stating both must be
supplied to enable contamination checking, otherwise perform the import from
openverifiablellm.contamination and set check_contamination as before.
In `@pyproject.toml`:
- Around line 14-19: Remove "datasets", "pyyaml", and "rbloom" from the
top-level dependencies array in pyproject.toml and add them under an optional
extra named "contamination" (e.g., project.optional-dependencies.contamination
or tool.poetry.extras.contamination depending on your pyproject format); keep
"defusedxml" in core deps. This aligns with the lazy-imports used around
check_contamination and the conditional use of bloom_filter and benchmark_texts
so users who never enable contamination won't install those packages. Ensure the
extra lists the same package names (add version pins if project uses them) and
run a quick packaging/installation test to confirm optional extra installs when
requested.
In `@tests/test_contamination.py`:
- Around line 96-103: The variable bloom1 in test_serialisation_roundtrip is
unused and triggers an F841 lint error; remove the unused binding by replacing
the standalone build_bloom_filter call with either an assertion on its
side-effect (e.g. confirm the file was created or that subsequent
build_bloom_filter(read) returns expected state) or simply call
build_bloom_filter(texts, config) without assigning to bloom1. Update references
in the test to use bloom2 and the existing get_ngrams loop; ensure any necessary
side-effect checks reference the build_bloom_filter invocation (use file
existence or config-backed state) rather than an unused bloom1 variable.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: af5c1849-6dbc-409e-a594-04eedda3423e
📒 Files selected for processing (6)
openverifiablellm/config.pyopenverifiablellm/contamination.pyopenverifiablellm/utils.pypyproject.tomltests/test_contamination.pytests/test_util.py
|
@DhruvK278 Pull from main, fix the conflicts and check again. We are using |
Introduce contamination detection pipeline and configuration handling. Adds openverifiablellm/config.py to load benchmark lists (CLI, YAML, defaults) and BenchmarkConfig dataclass. Adds openverifiablellm/contamination.py implementing n-gram generation, benchmark loaders (HF and local JSONL/CSV), Bloom filter construction/serialization (rbloom) and a two-stage contamination check (bloom + exact verification). Update openverifiablellm/utils.py to: adjust compute_sha256/compute_merkle_root call signatures, integrate optional contamination checking into extract_text_from_xml, track redacted chunks, and extend generate_manifest to include contamination metadata and merkle fields. Update pyproject.toml to add runtime deps: datasets, pyyaml, rbloom. Add comprehensive tests (tests/test_contamination.py) for n-grams, bloom filter behavior, contamination detection, manifest fields and backward compatibility; adjust tests/test_util.py to match renamed compute_sha256 parameter. Overall goal: enable detection and recording of dataset contamination when preprocessing large corpora.
0652703 to
3b8746a
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/test_util.py (1)
89-104: 🧹 Nitpick | 🔵 TrivialDrop the duplicated hash assertions.
These lines exercise the exact same path twice and do not add coverage. Keeping a single call/assert per case will make the tests easier to read and less noisy when they fail.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_util.py` around lines 89 - 104, Remove the duplicated redundant calls/assertions in the tests that call utils.compute_sha256 twice for the same inputs: in the first test remove the repeated assignment to actual (duplicate compute_sha256 call) and in test_different_content_different_hash remove the second identical assert; keep a single compute_sha256 invocation and a single assert per case to avoid duplicate coverage and noisy failures (references: utils.compute_sha256, test_different_content_different_hash).
♻️ Duplicate comments (3)
openverifiablellm/contamination.py (2)
153-158:⚠️ Potential issue | 🟠 MajorInvalidate the Bloom cache when the inputs change.
The shared
filter_pathis reused solely on file existence. Switching benchmark sources,config.n, or even normalization logic can therefore load a stale filter and silently misclassify chunks. Persist cache metadata alongside the Bloom bytes and rebuild when it no longer matches the current inputs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/contamination.py` around lines 153 - 158, The Bloom cache is being loaded solely by file existence (filter_path) which can return stale data when inputs change; update the logic around Bloom.load_bytes and _bloom_hash to persist and validate cache metadata (e.g., source id, config.n, normalization/version hash) alongside the raw bytes and, on load, compare the stored metadata against current inputs and normalization parameters and rebuild the Bloom (recompute via the code path that creates a new Bloom) if any mismatch is detected; ensure the metadata is written whenever creating/updating the cache so subsequent runs can validate before returning the cached Bloom.
64-69:⚠️ Potential issue | 🔴 CriticalDo not execute arbitrary HF dataset code or fail open on load errors.
BenchmarkConfig.benchmarksis populated from CLI/YAML input, soload_dataset(dataset_id, trust_remote_code=True)lets untrusted dataset IDs run remote code. The broadexceptthen converts auth/network/schema failures into[], which silently disables contamination checking for that source. Defaulttrust_remote_codetoFalseand re-raise after logging instead of returning an empty corpus.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/contamination.py` around lines 64 - 69, The code currently calls load_dataset(dataset_id, trust_remote_code=True) and swallows all exceptions returning an empty texts list; change the call to load_dataset(dataset_id, trust_remote_code=False) to avoid running untrusted remote code, and after catching Exception in the load_dataset block use logger.warning("Could not load HF dataset '%s': %s", dataset_id, exc) (or similar) but re-raise the exception instead of returning texts so callers know loading failed; keep the texts variable initialization and only return it in the normal success path (i.e., remove the early return in the except block) to ensure contamination checks do not silently skip on load errors.pyproject.toml (1)
14-19:⚠️ Potential issue | 🔴 CriticalRestore
tokenizersto the runtime dependencies.
openverifiablellm/tokenizer/bpe_tokenizer.pydirectly importsByteLevelBPETokenizerfrom thetokenizerspackage, andopenverifiablellm/tokenizer/factory.pyinstantiatesBPETokenizerwhentokenizer_type == "bpe". Removingtokenizersfromproject.dependencieswill cause anImportErrorat runtime when the BPE tokenizer path is used.📦 Minimal fix
dependencies= [ "defusedxml", + "tokenizers==0.15.2", "datasets", "pyyaml", "rbloom" ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` around lines 14 - 19, The pyproject.toml removed the runtime dependency on tokenizers but openverifiablellm/tokenizer/bpe_tokenizer.py imports ByteLevelBPETokenizer and openverifiablellm/tokenizer/factory.py may instantiate BPETokenizer when tokenizer_type == "bpe", causing ImportError at runtime; restore "tokenizers" in the dependencies list in pyproject.toml (add "tokenizers" to the dependencies array) so ByteLevelBPETokenizer and BPETokenizer can be imported/instantiated at runtime.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/contamination.py`:
- Around line 113-128: get_ngrams currently allows non-positive n which leads to
producing empty-string n-grams; modify get_ngrams to validate its n argument and
raise a ValueError for n <= 0. In the get_ngrams function add a guard at the
start (before normalization/tokenisation) that checks if n is an integer > 0 and
raises ValueError("n must be a positive integer") or similar to prevent
inserting empty n-grams into downstream functions like build_bloom_filter and
check_contamination.
In `@openverifiablellm/utils.py`:
- Around line 40-46: In the Merkle leaf construction and empty-case return,
remove the duplicated calls to compute_sha256 so each chunk is hashed only once
and the empty case returns a single compute_sha256 result; specifically, in the
block that sets leaf_hex (currently calling compute_sha256 twice) keep only one
call to compute_sha256 when assigning leaf_hex and appending
bytes.fromhex(leaf_hex) to leaves, and in the empty-list branch ensure only one
return compute_sha256(data=b"") remains; references to look for: leaf_hex,
compute_sha256, leaves.
- Around line 295-299: The manifest currently stores only benchmarks_used and
redacted_chunks which is insufficient to reproduce the contamination snapshot;
update the code that builds manifest (where
manifest["contamination_checks_passed"] = benchmarks_used and
manifest["redacted_chunks"] = redacted_chunks) to persist per-source snapshot
metadata: for each entry in benchmarks_used record dataset identifier, dataset
config/split, exact version or commit/tag (for HF datasets use dataset
build/version or datestamp), local file path if used, and a content digest
(e.g., sha256) of the dataset file or snapshot; also attach redaction details
per-source (e.g., redacted_chunks keyed by source) so the manifest contains
source->(version, path, hash, redacted_chunks) enabling exact reproduction.
- Around line 162-168: The module entrypoint still calls extract_text_from_xml
with a write_manifest argument but the function signature removed it, causing a
TypeError; update extract_text_from_xml to accept write_manifest (e.g., add
write_manifest=False to the parameter list) or add **kwargs to the signature so
the CLI call no longer fails, and ensure any write_manifest behavior (if needed)
is handled or ignored inside extract_text_from_xml; reference the
extract_text_from_xml function to locate and change its signature accordingly.
---
Outside diff comments:
In `@tests/test_util.py`:
- Around line 89-104: Remove the duplicated redundant calls/assertions in the
tests that call utils.compute_sha256 twice for the same inputs: in the first
test remove the repeated assignment to actual (duplicate compute_sha256 call)
and in test_different_content_different_hash remove the second identical assert;
keep a single compute_sha256 invocation and a single assert per case to avoid
duplicate coverage and noisy failures (references: utils.compute_sha256,
test_different_content_different_hash).
---
Duplicate comments:
In `@openverifiablellm/contamination.py`:
- Around line 153-158: The Bloom cache is being loaded solely by file existence
(filter_path) which can return stale data when inputs change; update the logic
around Bloom.load_bytes and _bloom_hash to persist and validate cache metadata
(e.g., source id, config.n, normalization/version hash) alongside the raw bytes
and, on load, compare the stored metadata against current inputs and
normalization parameters and rebuild the Bloom (recompute via the code path that
creates a new Bloom) if any mismatch is detected; ensure the metadata is written
whenever creating/updating the cache so subsequent runs can validate before
returning the cached Bloom.
- Around line 64-69: The code currently calls load_dataset(dataset_id,
trust_remote_code=True) and swallows all exceptions returning an empty texts
list; change the call to load_dataset(dataset_id, trust_remote_code=False) to
avoid running untrusted remote code, and after catching Exception in the
load_dataset block use logger.warning("Could not load HF dataset '%s': %s",
dataset_id, exc) (or similar) but re-raise the exception instead of returning
texts so callers know loading failed; keep the texts variable initialization and
only return it in the normal success path (i.e., remove the early return in the
except block) to ensure contamination checks do not silently skip on load
errors.
In `@pyproject.toml`:
- Around line 14-19: The pyproject.toml removed the runtime dependency on
tokenizers but openverifiablellm/tokenizer/bpe_tokenizer.py imports
ByteLevelBPETokenizer and openverifiablellm/tokenizer/factory.py may instantiate
BPETokenizer when tokenizer_type == "bpe", causing ImportError at runtime;
restore "tokenizers" in the dependencies list in pyproject.toml (add
"tokenizers" to the dependencies array) so ByteLevelBPETokenizer and
BPETokenizer can be imported/instantiated at runtime.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 0703ff93-109e-4739-8fed-c98c72ea29bc
📒 Files selected for processing (6)
openverifiablellm/config.pyopenverifiablellm/contamination.pyopenverifiablellm/utils.pypyproject.tomltests/test_contamination.pytests/test_util.py
| # --- contamination metadata --- | ||
| if benchmarks_used is not None: | ||
| manifest["contamination_checks_passed"] = benchmarks_used | ||
| manifest["redacted_chunks"] = redacted_chunks | ||
|
|
There was a problem hiding this comment.
Persist benchmark snapshot metadata, not just names.
["gsm8k", "cais/mmlu"] plus redacted_chunks is not enough to reproduce the contamination corpus later. HF datasets can move, and local files need at least a path/hash. Without per-source revision or content digest data, the manifest cannot prove which benchmark snapshot was actually checked.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openverifiablellm/utils.py` around lines 295 - 299, The manifest currently
stores only benchmarks_used and redacted_chunks which is insufficient to
reproduce the contamination snapshot; update the code that builds manifest
(where manifest["contamination_checks_passed"] = benchmarks_used and
manifest["redacted_chunks"] = redacted_chunks) to persist per-source snapshot
metadata: for each entry in benchmarks_used record dataset identifier, dataset
config/split, exact version or commit/tag (for HF datasets use dataset
build/version or datestamp), local file path if used, and a content digest
(e.g., sha256) of the dataset file or snapshot; also attach redaction details
per-source (e.g., redacted_chunks keyed by source) so the manifest contains
source->(version, path, hash, redacted_chunks) enabling exact reproduction.
|
Not the final push. Adding the required changes |
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (3)
openverifiablellm/contamination.py (1)
118-133:⚠️ Potential issue | 🟠 MajorReject non-positive n-gram sizes.
n <= 0still generates empty-string n-grams here. Oncebuild_bloom_filter()inserts"",check_contamination()can start flagging unrelated text because"" in any_stringis always true. Guard this before normalisation/tokenisation.Proposed fix
def get_ngrams(text: str, n: int = 13) -> List[str]: @@ - normalised = _normalise(text) + if n <= 0: + raise ValueError("n must be a positive integer") + + normalised = _normalise(text) tokens = normalised.split()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/contamination.py` around lines 118 - 133, get_ngrams currently accepts n <= 0 and will generate empty-string n-grams which then pollute bloom filters; add an explicit guard at the top of get_ngrams to reject non-positive n (e.g. raise ValueError when n <= 0) before any normalization/tokenisation, so build_bloom_filter and check_contamination never see an empty-string n-gram; reference the get_ngrams function and ensure callers (e.g. build_bloom_filter / check_contamination paths) will receive the error rather than inserting "" into the filter.openverifiablellm/utils.py (2)
40-46: 🧹 Nitpick | 🔵 TrivialRemove the repeated SHA-256 work in Merkle construction.
Lines 40/41 and 55/56 hash the same bytes twice, and Line 46 is unreachable after the first return. The result is unchanged, but large dumps now pay the hashing cost twice for every leaf and parent.
Proposed fix
while chunk := f.read(chunk_size): # reuse compute_sha256 leaf_hex = compute_sha256(data=chunk) - leaf_hex = compute_sha256(data=chunk) leaves.append(bytes.fromhex(leaf_hex)) @@ if not leaves: return compute_sha256(data=b"") - return compute_sha256(data=b"") @@ combined = left + right parent_hex = compute_sha256(data=combined) - parent_hex = compute_sha256(data=combined) next_level.append(bytes.fromhex(parent_hex))Also applies to: 55-56
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 40 - 46, The Merkle construction is performing duplicate SHA-256 work and has an unreachable return: call compute_sha256 exactly once per input chunk (replace the two identical calls that produce leaf_hex with a single call and append bytes.fromhex(leaf_hex) once), and remove the duplicated return so the empty-tree case returns compute_sha256(data=b"") only once; apply the same deduping fix where parent hashes are computed (ensure each parent node is hashed a single time via compute_sha256 and no duplicated calls remain).
263-270:⚠️ Potential issue | 🟠 MajorPersist benchmark snapshot metadata, not just names.
The manifest payload built here only carries
benchmarks_usedandredacted_chunks. That is still not enough to reproduce which benchmark snapshot was checked later, especially for mutable Hugging Face datasets or local files. Please include per-source revision/path/content digest incontamination_metadata.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 263 - 270, The current contamination_metadata only stores benchmark names; change it to persist per-benchmark snapshot metadata (not just names) so the check is reproducible: when building contamination_metadata in the check_contamination branch, replace benchmarks_used list of strings with a list of dicts containing unique snapshot fields (e.g., "name"/"id", "source" (hf/local), "path" or "config", "revision"/"commit" if available, and a content_digest/hash for the snapshot or file); compute content_digest for local files (e.g., sha256 of file bytes) and capture dataset revision/sha or dataset fingerprint for HuggingFace datasets, and ensure redacted_chunks remains included; update any callers that expect benchmarks_used to handle the new list-of-dicts structure (look for uses of contamination_metadata and the variable benchmarks_used).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/config.py`:
- Around line 50-55: The parsing currently accepts an empty benchmarks list or
entries that are only whitespace/comma, leading to an empty effective filter;
update the parsing of the "benchmarks" value (the local variable benchmarks in
openverifiablellm/config.py) to strip and filter out empty strings (e.g.,
split/trim multi-value CLI input and remove items where not b.strip()) and
require that at least one non-empty benchmark remains; if none remain, either
raise a ValueError or fall back to the default benchmark list. Apply the same
validation/fallback logic to the other benchmarks-handling block referenced
around lines 71-76 so both code paths reject/handle empty selections
consistently.
In `@openverifiablellm/contamination.py`:
- Around line 39-45: The loop over config.benchmarks currently falls back to
_load_hf when a path has a local suffix but the file is missing; change the
logic in the for benchmark in config.benchmarks loop to treat a benchmark that
ends with .jsonl or .csv as a required local file: use Path(benchmark) and if
path.suffix in (".jsonl", ".csv") then if not path.is_file() raise a clear
FileNotFoundError (or ValueError) referencing the benchmark string instead of
calling _load_hf, otherwise call _load_local(path); only call _load_hf for
entries that do not have a local suffix.
- Around line 243-250: The current exact-verification step uses substring
matching ("if ngram in nb") which allows false positives across token
boundaries; change it to compare whole-token n-grams by tokenizing each
normalised benchmark (use the same tokenization used to build
`ngrams`/`_normalise`) and check whether any contiguous token window equals the
`ngram` (or alternatively pad both `nb` and `ngram` with explicit boundary
markers before doing `in`); update the loop in the verification code that
iterates over `ngrams`, `normalised_benchmarks`, and `nb` to perform token-level
comparison (e.g., split `nb` into tokens and compare `"
".join(tokens[i:i+len_ngram]) == ngram`) so matches respect token boundaries.
---
Duplicate comments:
In `@openverifiablellm/contamination.py`:
- Around line 118-133: get_ngrams currently accepts n <= 0 and will generate
empty-string n-grams which then pollute bloom filters; add an explicit guard at
the top of get_ngrams to reject non-positive n (e.g. raise ValueError when n <=
0) before any normalization/tokenisation, so build_bloom_filter and
check_contamination never see an empty-string n-gram; reference the get_ngrams
function and ensure callers (e.g. build_bloom_filter / check_contamination
paths) will receive the error rather than inserting "" into the filter.
In `@openverifiablellm/utils.py`:
- Around line 40-46: The Merkle construction is performing duplicate SHA-256
work and has an unreachable return: call compute_sha256 exactly once per input
chunk (replace the two identical calls that produce leaf_hex with a single call
and append bytes.fromhex(leaf_hex) once), and remove the duplicated return so
the empty-tree case returns compute_sha256(data=b"") only once; apply the same
deduping fix where parent hashes are computed (ensure each parent node is hashed
a single time via compute_sha256 and no duplicated calls remain).
- Around line 263-270: The current contamination_metadata only stores benchmark
names; change it to persist per-benchmark snapshot metadata (not just names) so
the check is reproducible: when building contamination_metadata in the
check_contamination branch, replace benchmarks_used list of strings with a list
of dicts containing unique snapshot fields (e.g., "name"/"id", "source"
(hf/local), "path" or "config", "revision"/"commit" if available, and a
content_digest/hash for the snapshot or file); compute content_digest for local
files (e.g., sha256 of file bytes) and capture dataset revision/sha or dataset
fingerprint for HuggingFace datasets, and ensure redacted_chunks remains
included; update any callers that expect benchmarks_used to handle the new
list-of-dicts structure (look for uses of contamination_metadata and the
variable benchmarks_used).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: e6d5f329-2beb-4897-a11c-e00dd2ddaefe
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (5)
openverifiablellm/config.pyopenverifiablellm/contamination.pyopenverifiablellm/utils.pypyproject.tomltests/test_contamination.py
|
@Archit381 Please review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/contamination.py`:
- Around line 251-259: The hot path inside check_contamination rebuilds
normalised_benchmarks on every Bloom hit (see normalised_benchmarks, _normalise,
benchmark_texts), causing repeated O(number of benchmarks) work; precompute the
set of normalized benchmark texts once (e.g., in the preprocessing entry point
that calls extract_text_from_xml) and change check_contamination to accept a
precomputed_normalised_benchmarks parameter instead of constructing it inline,
then use that set for the inner exact-match loop to avoid rebuilding it on each
invocation.
- Around line 164-167: The current inputs_hash is computed by concatenating raw
benchmark_texts bytes which can collide across different segmentations; update
the hashing in the function that computes inputs_hash (use the hasher variable
near the loop over benchmark_texts) to include an unambiguous delimiter or
length-prefix for each text (e.g., write each text length then the text or
serialize each entry with a structured encoding) before hasher.update, so the
resulting inputs_hash cannot be forged by different concatenations of the same
bytes.
- Around line 72-84: The current load in contamination.py uses
load_dataset(dataset_id, trust_remote_code=trust_remote_code) which returns all
splits and pulls non-evaluation data into texts; change this to load only
evaluation splits (e.g., pass split="test" or "validation" as appropriate) or
make the split selection configurable via a new parameter (e.g.,
evaluation_splits or split_map) so that GSM8K uses "test" and MMLU uses
"test"/"validation"; adjust the call site that constructs ds (still handling
exceptions and raising DatasetLoadError) and then iterate over only those chosen
splits to populate texts (respecting _TEXT_FIELDS and the existing texts append
logic).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 674750de-288e-4aa4-83eb-9d730e04de24
📒 Files selected for processing (3)
openverifiablellm/config.pyopenverifiablellm/contamination.pytests/test_contamination.py
|
fix lint issues and resolve conflicts |
fd7475d to
15c6e2a
Compare
Done!! |
9e5638f to
ebe7eb5
Compare
|
@Archit381 Please review. I have fixed the conflict and lint issue |
closes #50
Description
This PR introduces a Data Contamination Detection feature to the OpenVerifiableLLM preprocessing pipeline.
The goal of this feature is to automatically detect and redact text chunks from Wikipedia dumps that overlap with evaluation benchmark datasets (e.g., GSM8K, MMLU). This prevents benchmark contamination when trailing LLMs on the generated data. The detection uses an efficient two-stage approach: a fast n-gram matching mechanism backed by a Bloom filter, followed by an exact string match check to eliminate false positives.
Key Changes Made:
1. Configuration & Dependencies:
datasets,pyyaml, andrbloomtopyproject.toml.openverifiablellm/config.pyto handle benchmark list loading via aBenchmarkConfigdataclass. Supports loading from CLI arguments (--benchmarks), a localbenchmarks.yamlfile, or defaults to["gsm8k", "cais/mmlu"].2. Core Contamination Detecton (
openverifiablellm/contamination.py):fetch_benchmarksto load datasets either from Hugging Face or local JSONL/CSV files.get_ngramsto generate and normalize sliding-window n-grams (default 13-grams).build_bloom_filterto construct, populate, and serialize an efficiently sizedrbloom.Bloomfilter to disk.check_contaminationto perform the two-stage contamination check (Bloom filter first, exact match on hit).3. Pipeline Integration (
openverifiablellm/utils.py):extract_text_from_xmlto accept optional contamination checking parameters. If triggered, contaminated text chunks are silently skipped and tracked.generate_manifestto include "contamination_checks_passed" and "redacted_chunks" metadata fields.compute_sha256andcompute_merkle_rootcalls throughout the codebase to correctly use the mandated keyword arguments (data=orfile_path=).4. Testing & Verification:
tests/test_contamination.pycontaining 17 comprehensive unit tests covering n-gram generation, bloom filter serialization, positive/negative contamination detection, and pipeline backward compatibility.tests/test_util.pyto align with thecompute_sha256keyword argument fixes.Screenshots/Recordings:
Additional Notes:
rbloomfor fast Rust-backed Bloom filter construction. Implemented a custom SHA-256-based hash function_bloom_hashto make the Bloom filter serializable viasave_bytes()andload_bytes().extract_text_from_xmlandgenerate_manifestare optional. Callingpython -m openverifiablellm.utils <input_dump>will continue to work normally without contamination checking.Checklist
We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.
Summary by CodeRabbit
New Features
Improvements
Tests
Dependencies