Skip to content

[FEATURE]: Add contamination detection and config loader#60

Open
DhruvK278 wants to merge 6 commits intoAOSSIE-Org:mainfrom
DhruvK278:feat/data-contamination-detection
Open

[FEATURE]: Add contamination detection and config loader#60
DhruvK278 wants to merge 6 commits intoAOSSIE-Org:mainfrom
DhruvK278:feat/data-contamination-detection

Conversation

@DhruvK278
Copy link

@DhruvK278 DhruvK278 commented Mar 10, 2026

closes #50

Description

This PR introduces a Data Contamination Detection feature to the OpenVerifiableLLM preprocessing pipeline.

The goal of this feature is to automatically detect and redact text chunks from Wikipedia dumps that overlap with evaluation benchmark datasets (e.g., GSM8K, MMLU). This prevents benchmark contamination when trailing LLMs on the generated data. The detection uses an efficient two-stage approach: a fast n-gram matching mechanism backed by a Bloom filter, followed by an exact string match check to eliminate false positives.

Key Changes Made:

1. Configuration & Dependencies:

  • Added datasets, pyyaml, and rbloom to pyproject.toml.
  • Created openverifiablellm/config.py to handle benchmark list loading via a BenchmarkConfig dataclass. Supports loading from CLI arguments (--benchmarks), a local benchmarks.yaml file, or defaults to ["gsm8k", "cais/mmlu"].

2. Core Contamination Detecton (openverifiablellm/contamination.py):

  • Added fetch_benchmarks to load datasets either from Hugging Face or local JSONL/CSV files.
  • Implemented get_ngrams to generate and normalize sliding-window n-grams (default 13-grams).
  • Implemented build_bloom_filter to construct, populate, and serialize an efficiently sized rbloom.Bloom filter to disk.
  • Added check_contamination to perform the two-stage contamination check (Bloom filter first, exact match on hit).

3. Pipeline Integration (openverifiablellm/utils.py):

  • Updated extract_text_from_xml to accept optional contamination checking parameters. If triggered, contaminated text chunks are silently skipped and tracked.
  • Extended generate_manifest to include "contamination_checks_passed" and "redacted_chunks" metadata fields.
  • Bug Fix: Adjusted compute_sha256 and compute_merkle_root calls throughout the codebase to correctly use the mandated keyword arguments (data= or file_path=).

4. Testing & Verification:

  • Added tests/test_contamination.py containing 17 comprehensive unit tests covering n-gram generation, bloom filter serialization, positive/negative contamination detection, and pipeline backward compatibility.
  • Updated tests/test_util.py to align with the compute_sha256 keyword argument fixes.

Screenshots/Recordings:

image All tests has been passed

Additional Notes:

  • Bloom filter library selection: Used rbloom for fast Rust-backed Bloom filter construction. Implemented a custom SHA-256-based hash function _bloom_hash to make the Bloom filter serializable via save_bytes() and load_bytes().
  • Backward compatibility: All newly introduced arguments to extract_text_from_xml and generate_manifest are optional. Calling python -m openverifiablellm.utils <input_dump> will continue to work normally without contamination checking.

Checklist

  • My code follows the project's code style and conventions
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings or errors
  • I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • I have read the Contributing Guidelines

⚠️ AI Notice - Important!

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.

Summary by CodeRabbit

  • New Features

    • Added contamination detection tooling with two-stage checks and automatic redaction; new configuration loader honoring CLI, YAML, or defaults.
  • Improvements

    • Manifests include contamination metadata (benchmarks used, redacted chunk counts).
    • On-the-fly contamination checks during text extraction; checksum utility accepts raw data or file input.
  • Tests

    • Comprehensive tests covering n-grams, Bloom filter behavior, contamination checks, manifest metadata, and related edge cases.
  • Dependencies

    • Added sentencepiece and an optional contamination extras group (datasets, pyyaml, rbloom).

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 10, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a configurable contamination-detection subsystem: resolve benchmarks (CLI / YAML / defaults), load benchmark texts (Hugging Face or local), build/serialize an n‑gram Bloom filter, perform two‑stage contamination checks (Bloom + exact verify), redact contaminated XML chunks during extraction, and record contamination metadata in dataset manifests.

Changes

Cohort / File(s) Summary
Benchmark Configuration
openverifiablellm/config.py
New BenchmarkConfig, DEFAULT_BENCHMARKS, BENCHMARKS_YAML, _load_from_yaml() and load_config() to resolve CLI/YAML/default benchmarks and Bloom parameters.
Contamination Detection Core
openverifiablellm/contamination.py
New module providing benchmark loaders (HF/local), n‑gram normalization (get_ngrams), deterministic Bloom hashing, build_bloom_filter() with metadata caching, two‑stage check_contamination() (Bloom + exact verify), and DatasetLoadError.
Pipeline Integration & Utilities
openverifiablellm/utils.py
extract_text_from_xml() accepts contamination args and lazy‑imports checker, redacts contaminated chunks and counts them; generate_manifest() accepts contamination metadata; compute_sha256() signature changed to keyword‑only data=/file_path=.
Dependencies / Packaging
pyproject.toml
Adds sentencepiece; new optional groups dev = ["pytest"] and contamination = ["datasets", "pyyaml", "rbloom"].
Tests
tests/test_contamination.py, tests/test_util.py
Adds comprehensive contamination tests (n‑grams, Bloom lifecycle, contamination flow, manifest metadata) and small util-test adjustments for updated signatures.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI/Config
    participant Config as load_config()
    participant Fetch as fetch_benchmarks()
    participant HF as HuggingFace Datasets
    participant Local as Local Files
    participant Bloom as build_bloom_filter()
    participant Extract as extract_text_from_xml()
    participant Check as check_contamination()
    participant Manifest as generate_manifest()

    CLI->>Config: load_config(cli_benchmarks)
    Config->>Fetch: provide BenchmarkConfig
    Fetch->>HF: optionally load HF datasets
    Fetch->>Local: optionally load local files (JSONL/CSV)
    Fetch-->>Bloom: return benchmark_texts
    Bloom->>Bloom: build or load cached Bloom filter (serialize & metadata)
    Bloom-->>Extract: provide bloom_filter, benchmarks_used
    Extract->>Check: check_contamination(chunk, bloom_filter)
    Check->>Check: n-gram generation + Bloom lookup
    Check->>Check: exact substring verification against benchmark_texts
    Check-->>Extract: contamination result
    Extract->>Extract: redact contaminated chunks, count redactions
    Extract-->>Manifest: pass contamination metadata (benchmarks_used, redacted_chunks)
    Manifest->>Manifest: write dataset_manifest.json with contamination fields
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

Python Lang

Suggested reviewers

  • Archit381

Poem

🐰 I nibble n‑grams in moonlit code,
I plant a Bloom to guard the road.
When crumbs appear I softly hide,
Redacted chunks hop off to tide.
Clean manifests — a bunny's pride.

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Out of Scope Changes check ⚠️ Warning The PR includes unnecessary code changes: duplicate hash computations in compute_merkle_root and compute_sha256, and redundant test assertions in test_util.py that do not support the contamination detection feature. Remove duplicate computations from compute_merkle_root and redundant test assertions from test_util.py to keep the changeset focused on contamination detection implementation.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the main changes: adding contamination detection and a config loader, which are the primary features introduced.
Linked Issues check ✅ Passed The PR implements all core requirements from issue #50: configurable benchmark selection via CLI and YAML, Bloom filter construction and serialization, contamination detection with two-stage verification, and manifest metadata recording.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added size/XL and removed size/XL labels Mar 10, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/contamination.py`:
- Around line 153-158: The cached Bloom filter loading uses filter_path and
Bloom.load_bytes(raw, _bloom_hash) without validating the inputs that produced
it, so different runs can reuse a stale filter; update the cache handling to
validate metadata (benchmark sources, config.n, normalization/hash version)
before returning: when saving a Bloom (created in the code paths that call
Bloom.create or Bloom.add), persist a small metadata blob (inputs hash,
config.n, normalization/version) alongside the filter file, and when loading via
the block that uses filter_path and Bloom.load_bytes, read and compare that
metadata to current values and only return the loaded bloom if they
match—otherwise delete or ignore the cache and rebuild the Bloom to ensure
Bloom.load_bytes and _bloom_hash are only used for matching input state.
- Around line 65-69: The current except block around load_dataset silences
dataset/network/auth failures by returning texts; instead propagate the failure:
remove the "return texts" and either re-raise the caught exception (raise) or
wrap and raise a descriptive custom exception (e.g., DatasetLoadError) so the
run fails, or record the per-source failure into the manifest/state before
raising; update the block that references load_dataset, dataset_id,
logger.warning and texts in contamination.py to either propagate the error or
persist a failure entry so callers can handle it rather than getting an empty
corpus.
- Line 66: The call to load_dataset(dataset_id, trust_remote_code=True) uses
untrusted user input and enables remote code execution; remove the hardcoded
trust_remote_code=True and instead gate it behind an explicit opt-in flag (e.g.
a config/CLI boolean like allow_trust_remote_code or trust_remote_code)
defaulting to False, then pass trust_remote_code only when that flag is true;
update the code path that calls load_dataset (reference: load_dataset and
dataset_id in contamination.py) to consult the new flag and ensure users must
explicitly enable it before remote code is allowed.

In `@openverifiablellm/utils.py`:
- Around line 131-136: The manifest generation currently omits contamination
metadata when benchmarks_used is None, making redacted runs indistinguishable
from unchecked runs; update the calls to generate_manifest (and the
generate_manifest implementation if needed) so that when contamination checking
is enabled you always pass a structured contamination/config/result block (e.g.,
a contamination or contamination_metadata argument) even if benchmarks_used is
None, and ensure generate_manifest writes that block to the manifest; apply the
same change to the other call sites noted (the call around lines 169-171) so the
manifest always contains the contamination check config and outcome when
checking is performed.
- Around line 86-90: Current lazy-import logic silently disables contamination
checking when only one of bloom_filter or benchmark_texts is provided; update
the block that sets check_contamination to validate configuration and fail fast:
if exactly one of bloom_filter or benchmark_texts is None (i.e., (bloom_filter
is None) != (benchmark_texts is None)), raise a clear configuration error
(ValueError or a project-specific ConfigurationError) stating both must be
supplied to enable contamination checking, otherwise perform the import from
openverifiablellm.contamination and set check_contamination as before.

In `@pyproject.toml`:
- Around line 14-19: Remove "datasets", "pyyaml", and "rbloom" from the
top-level dependencies array in pyproject.toml and add them under an optional
extra named "contamination" (e.g., project.optional-dependencies.contamination
or tool.poetry.extras.contamination depending on your pyproject format); keep
"defusedxml" in core deps. This aligns with the lazy-imports used around
check_contamination and the conditional use of bloom_filter and benchmark_texts
so users who never enable contamination won't install those packages. Ensure the
extra lists the same package names (add version pins if project uses them) and
run a quick packaging/installation test to confirm optional extra installs when
requested.

In `@tests/test_contamination.py`:
- Around line 96-103: The variable bloom1 in test_serialisation_roundtrip is
unused and triggers an F841 lint error; remove the unused binding by replacing
the standalone build_bloom_filter call with either an assertion on its
side-effect (e.g. confirm the file was created or that subsequent
build_bloom_filter(read) returns expected state) or simply call
build_bloom_filter(texts, config) without assigning to bloom1. Update references
in the test to use bloom2 and the existing get_ngrams loop; ensure any necessary
side-effect checks reference the build_bloom_filter invocation (use file
existence or config-backed state) rather than an unused bloom1 variable.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: af5c1849-6dbc-409e-a594-04eedda3423e

📥 Commits

Reviewing files that changed from the base of the PR and between 35ce9dc and 0652703.

📒 Files selected for processing (6)
  • openverifiablellm/config.py
  • openverifiablellm/contamination.py
  • openverifiablellm/utils.py
  • pyproject.toml
  • tests/test_contamination.py
  • tests/test_util.py

@Archit381
Copy link
Member

Archit381 commented Mar 10, 2026

@DhruvK278 Pull from main, fix the conflicts and check again. We are using uv to package now so test if there's any issue

Introduce contamination detection pipeline and configuration handling. Adds openverifiablellm/config.py to load benchmark lists (CLI, YAML, defaults) and BenchmarkConfig dataclass. Adds openverifiablellm/contamination.py implementing n-gram generation, benchmark loaders (HF and local JSONL/CSV), Bloom filter construction/serialization (rbloom) and a two-stage contamination check (bloom + exact verification). Update openverifiablellm/utils.py to: adjust compute_sha256/compute_merkle_root call signatures, integrate optional contamination checking into extract_text_from_xml, track redacted chunks, and extend generate_manifest to include contamination metadata and merkle fields. Update pyproject.toml to add runtime deps: datasets, pyyaml, rbloom. Add comprehensive tests (tests/test_contamination.py) for n-grams, bloom filter behavior, contamination detection, manifest fields and backward compatibility; adjust tests/test_util.py to match renamed compute_sha256 parameter. Overall goal: enable detection and recording of dataset contamination when preprocessing large corpora.
@DhruvK278 DhruvK278 force-pushed the feat/data-contamination-detection branch from 0652703 to 3b8746a Compare March 10, 2026 19:17
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 10, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/test_util.py (1)

89-104: 🧹 Nitpick | 🔵 Trivial

Drop the duplicated hash assertions.

These lines exercise the exact same path twice and do not add coverage. Keeping a single call/assert per case will make the tests easier to read and less noisy when they fail.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_util.py` around lines 89 - 104, Remove the duplicated redundant
calls/assertions in the tests that call utils.compute_sha256 twice for the same
inputs: in the first test remove the repeated assignment to actual (duplicate
compute_sha256 call) and in test_different_content_different_hash remove the
second identical assert; keep a single compute_sha256 invocation and a single
assert per case to avoid duplicate coverage and noisy failures (references:
utils.compute_sha256, test_different_content_different_hash).
♻️ Duplicate comments (3)
openverifiablellm/contamination.py (2)

153-158: ⚠️ Potential issue | 🟠 Major

Invalidate the Bloom cache when the inputs change.

The shared filter_path is reused solely on file existence. Switching benchmark sources, config.n, or even normalization logic can therefore load a stale filter and silently misclassify chunks. Persist cache metadata alongside the Bloom bytes and rebuild when it no longer matches the current inputs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/contamination.py` around lines 153 - 158, The Bloom cache
is being loaded solely by file existence (filter_path) which can return stale
data when inputs change; update the logic around Bloom.load_bytes and
_bloom_hash to persist and validate cache metadata (e.g., source id, config.n,
normalization/version hash) alongside the raw bytes and, on load, compare the
stored metadata against current inputs and normalization parameters and rebuild
the Bloom (recompute via the code path that creates a new Bloom) if any mismatch
is detected; ensure the metadata is written whenever creating/updating the cache
so subsequent runs can validate before returning the cached Bloom.

64-69: ⚠️ Potential issue | 🔴 Critical

Do not execute arbitrary HF dataset code or fail open on load errors.

BenchmarkConfig.benchmarks is populated from CLI/YAML input, so load_dataset(dataset_id, trust_remote_code=True) lets untrusted dataset IDs run remote code. The broad except then converts auth/network/schema failures into [], which silently disables contamination checking for that source. Default trust_remote_code to False and re-raise after logging instead of returning an empty corpus.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/contamination.py` around lines 64 - 69, The code currently
calls load_dataset(dataset_id, trust_remote_code=True) and swallows all
exceptions returning an empty texts list; change the call to
load_dataset(dataset_id, trust_remote_code=False) to avoid running untrusted
remote code, and after catching Exception in the load_dataset block use
logger.warning("Could not load HF dataset '%s': %s", dataset_id, exc) (or
similar) but re-raise the exception instead of returning texts so callers know
loading failed; keep the texts variable initialization and only return it in the
normal success path (i.e., remove the early return in the except block) to
ensure contamination checks do not silently skip on load errors.
pyproject.toml (1)

14-19: ⚠️ Potential issue | 🔴 Critical

Restore tokenizers to the runtime dependencies.

openverifiablellm/tokenizer/bpe_tokenizer.py directly imports ByteLevelBPETokenizer from the tokenizers package, and openverifiablellm/tokenizer/factory.py instantiates BPETokenizer when tokenizer_type == "bpe". Removing tokenizers from project.dependencies will cause an ImportError at runtime when the BPE tokenizer path is used.

📦 Minimal fix
 dependencies= [
     "defusedxml",
+    "tokenizers==0.15.2",
     "datasets",
     "pyyaml",
     "rbloom"
 ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 14 - 19, The pyproject.toml removed the runtime
dependency on tokenizers but openverifiablellm/tokenizer/bpe_tokenizer.py
imports ByteLevelBPETokenizer and openverifiablellm/tokenizer/factory.py may
instantiate BPETokenizer when tokenizer_type == "bpe", causing ImportError at
runtime; restore "tokenizers" in the dependencies list in pyproject.toml (add
"tokenizers" to the dependencies array) so ByteLevelBPETokenizer and
BPETokenizer can be imported/instantiated at runtime.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/contamination.py`:
- Around line 113-128: get_ngrams currently allows non-positive n which leads to
producing empty-string n-grams; modify get_ngrams to validate its n argument and
raise a ValueError for n <= 0. In the get_ngrams function add a guard at the
start (before normalization/tokenisation) that checks if n is an integer > 0 and
raises ValueError("n must be a positive integer") or similar to prevent
inserting empty n-grams into downstream functions like build_bloom_filter and
check_contamination.

In `@openverifiablellm/utils.py`:
- Around line 40-46: In the Merkle leaf construction and empty-case return,
remove the duplicated calls to compute_sha256 so each chunk is hashed only once
and the empty case returns a single compute_sha256 result; specifically, in the
block that sets leaf_hex (currently calling compute_sha256 twice) keep only one
call to compute_sha256 when assigning leaf_hex and appending
bytes.fromhex(leaf_hex) to leaves, and in the empty-list branch ensure only one
return compute_sha256(data=b"") remains; references to look for: leaf_hex,
compute_sha256, leaves.
- Around line 295-299: The manifest currently stores only benchmarks_used and
redacted_chunks which is insufficient to reproduce the contamination snapshot;
update the code that builds manifest (where
manifest["contamination_checks_passed"] = benchmarks_used and
manifest["redacted_chunks"] = redacted_chunks) to persist per-source snapshot
metadata: for each entry in benchmarks_used record dataset identifier, dataset
config/split, exact version or commit/tag (for HF datasets use dataset
build/version or datestamp), local file path if used, and a content digest
(e.g., sha256) of the dataset file or snapshot; also attach redaction details
per-source (e.g., redacted_chunks keyed by source) so the manifest contains
source->(version, path, hash, redacted_chunks) enabling exact reproduction.
- Around line 162-168: The module entrypoint still calls extract_text_from_xml
with a write_manifest argument but the function signature removed it, causing a
TypeError; update extract_text_from_xml to accept write_manifest (e.g., add
write_manifest=False to the parameter list) or add **kwargs to the signature so
the CLI call no longer fails, and ensure any write_manifest behavior (if needed)
is handled or ignored inside extract_text_from_xml; reference the
extract_text_from_xml function to locate and change its signature accordingly.

---

Outside diff comments:
In `@tests/test_util.py`:
- Around line 89-104: Remove the duplicated redundant calls/assertions in the
tests that call utils.compute_sha256 twice for the same inputs: in the first
test remove the repeated assignment to actual (duplicate compute_sha256 call)
and in test_different_content_different_hash remove the second identical assert;
keep a single compute_sha256 invocation and a single assert per case to avoid
duplicate coverage and noisy failures (references: utils.compute_sha256,
test_different_content_different_hash).

---

Duplicate comments:
In `@openverifiablellm/contamination.py`:
- Around line 153-158: The Bloom cache is being loaded solely by file existence
(filter_path) which can return stale data when inputs change; update the logic
around Bloom.load_bytes and _bloom_hash to persist and validate cache metadata
(e.g., source id, config.n, normalization/version hash) alongside the raw bytes
and, on load, compare the stored metadata against current inputs and
normalization parameters and rebuild the Bloom (recompute via the code path that
creates a new Bloom) if any mismatch is detected; ensure the metadata is written
whenever creating/updating the cache so subsequent runs can validate before
returning the cached Bloom.
- Around line 64-69: The code currently calls load_dataset(dataset_id,
trust_remote_code=True) and swallows all exceptions returning an empty texts
list; change the call to load_dataset(dataset_id, trust_remote_code=False) to
avoid running untrusted remote code, and after catching Exception in the
load_dataset block use logger.warning("Could not load HF dataset '%s': %s",
dataset_id, exc) (or similar) but re-raise the exception instead of returning
texts so callers know loading failed; keep the texts variable initialization and
only return it in the normal success path (i.e., remove the early return in the
except block) to ensure contamination checks do not silently skip on load
errors.

In `@pyproject.toml`:
- Around line 14-19: The pyproject.toml removed the runtime dependency on
tokenizers but openverifiablellm/tokenizer/bpe_tokenizer.py imports
ByteLevelBPETokenizer and openverifiablellm/tokenizer/factory.py may instantiate
BPETokenizer when tokenizer_type == "bpe", causing ImportError at runtime;
restore "tokenizers" in the dependencies list in pyproject.toml (add
"tokenizers" to the dependencies array) so ByteLevelBPETokenizer and
BPETokenizer can be imported/instantiated at runtime.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0703ff93-109e-4739-8fed-c98c72ea29bc

📥 Commits

Reviewing files that changed from the base of the PR and between 0652703 and 3b8746a.

📒 Files selected for processing (6)
  • openverifiablellm/config.py
  • openverifiablellm/contamination.py
  • openverifiablellm/utils.py
  • pyproject.toml
  • tests/test_contamination.py
  • tests/test_util.py

Comment on lines +295 to +299
# --- contamination metadata ---
if benchmarks_used is not None:
manifest["contamination_checks_passed"] = benchmarks_used
manifest["redacted_chunks"] = redacted_chunks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Persist benchmark snapshot metadata, not just names.

["gsm8k", "cais/mmlu"] plus redacted_chunks is not enough to reproduce the contamination corpus later. HF datasets can move, and local files need at least a path/hash. Without per-source revision or content digest data, the manifest cannot prove which benchmark snapshot was actually checked.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 295 - 299, The manifest currently
stores only benchmarks_used and redacted_chunks which is insufficient to
reproduce the contamination snapshot; update the code that builds manifest
(where manifest["contamination_checks_passed"] = benchmarks_used and
manifest["redacted_chunks"] = redacted_chunks) to persist per-source snapshot
metadata: for each entry in benchmarks_used record dataset identifier, dataset
config/split, exact version or commit/tag (for HF datasets use dataset
build/version or datestamp), local file path if used, and a content digest
(e.g., sha256) of the dataset file or snapshot; also attach redaction details
per-source (e.g., redacted_chunks keyed by source) so the manifest contains
source->(version, path, hash, redacted_chunks) enabling exact reproduction.

@DhruvK278
Copy link
Author

Not the final push. Adding the required changes

@github-actions github-actions bot added size/XL and removed size/XL labels Mar 10, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (3)
openverifiablellm/contamination.py (1)

118-133: ⚠️ Potential issue | 🟠 Major

Reject non-positive n-gram sizes.

n <= 0 still generates empty-string n-grams here. Once build_bloom_filter() inserts "", check_contamination() can start flagging unrelated text because "" in any_string is always true. Guard this before normalisation/tokenisation.

Proposed fix
 def get_ngrams(text: str, n: int = 13) -> List[str]:
@@
-    normalised = _normalise(text)
+    if n <= 0:
+        raise ValueError("n must be a positive integer")
+
+    normalised = _normalise(text)
     tokens = normalised.split()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/contamination.py` around lines 118 - 133, get_ngrams
currently accepts n <= 0 and will generate empty-string n-grams which then
pollute bloom filters; add an explicit guard at the top of get_ngrams to reject
non-positive n (e.g. raise ValueError when n <= 0) before any
normalization/tokenisation, so build_bloom_filter and check_contamination never
see an empty-string n-gram; reference the get_ngrams function and ensure callers
(e.g. build_bloom_filter / check_contamination paths) will receive the error
rather than inserting "" into the filter.
openverifiablellm/utils.py (2)

40-46: 🧹 Nitpick | 🔵 Trivial

Remove the repeated SHA-256 work in Merkle construction.

Lines 40/41 and 55/56 hash the same bytes twice, and Line 46 is unreachable after the first return. The result is unchanged, but large dumps now pay the hashing cost twice for every leaf and parent.

Proposed fix
         while chunk := f.read(chunk_size):
             # reuse compute_sha256
             leaf_hex = compute_sha256(data=chunk)
-            leaf_hex = compute_sha256(data=chunk)
             leaves.append(bytes.fromhex(leaf_hex))
@@
     if not leaves:
         return compute_sha256(data=b"")
-        return compute_sha256(data=b"")
@@
             combined = left + right
             parent_hex = compute_sha256(data=combined)
-            parent_hex = compute_sha256(data=combined)
             next_level.append(bytes.fromhex(parent_hex))

Also applies to: 55-56

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 40 - 46, The Merkle construction is
performing duplicate SHA-256 work and has an unreachable return: call
compute_sha256 exactly once per input chunk (replace the two identical calls
that produce leaf_hex with a single call and append bytes.fromhex(leaf_hex)
once), and remove the duplicated return so the empty-tree case returns
compute_sha256(data=b"") only once; apply the same deduping fix where parent
hashes are computed (ensure each parent node is hashed a single time via
compute_sha256 and no duplicated calls remain).

263-270: ⚠️ Potential issue | 🟠 Major

Persist benchmark snapshot metadata, not just names.

The manifest payload built here only carries benchmarks_used and redacted_chunks. That is still not enough to reproduce which benchmark snapshot was checked later, especially for mutable Hugging Face datasets or local files. Please include per-source revision/path/content digest in contamination_metadata.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openverifiablellm/utils.py` around lines 263 - 270, The current
contamination_metadata only stores benchmark names; change it to persist
per-benchmark snapshot metadata (not just names) so the check is reproducible:
when building contamination_metadata in the check_contamination branch, replace
benchmarks_used list of strings with a list of dicts containing unique snapshot
fields (e.g., "name"/"id", "source" (hf/local), "path" or "config",
"revision"/"commit" if available, and a content_digest/hash for the snapshot or
file); compute content_digest for local files (e.g., sha256 of file bytes) and
capture dataset revision/sha or dataset fingerprint for HuggingFace datasets,
and ensure redacted_chunks remains included; update any callers that expect
benchmarks_used to handle the new list-of-dicts structure (look for uses of
contamination_metadata and the variable benchmarks_used).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/config.py`:
- Around line 50-55: The parsing currently accepts an empty benchmarks list or
entries that are only whitespace/comma, leading to an empty effective filter;
update the parsing of the "benchmarks" value (the local variable benchmarks in
openverifiablellm/config.py) to strip and filter out empty strings (e.g.,
split/trim multi-value CLI input and remove items where not b.strip()) and
require that at least one non-empty benchmark remains; if none remain, either
raise a ValueError or fall back to the default benchmark list. Apply the same
validation/fallback logic to the other benchmarks-handling block referenced
around lines 71-76 so both code paths reject/handle empty selections
consistently.

In `@openverifiablellm/contamination.py`:
- Around line 39-45: The loop over config.benchmarks currently falls back to
_load_hf when a path has a local suffix but the file is missing; change the
logic in the for benchmark in config.benchmarks loop to treat a benchmark that
ends with .jsonl or .csv as a required local file: use Path(benchmark) and if
path.suffix in (".jsonl", ".csv") then if not path.is_file() raise a clear
FileNotFoundError (or ValueError) referencing the benchmark string instead of
calling _load_hf, otherwise call _load_local(path); only call _load_hf for
entries that do not have a local suffix.
- Around line 243-250: The current exact-verification step uses substring
matching ("if ngram in nb") which allows false positives across token
boundaries; change it to compare whole-token n-grams by tokenizing each
normalised benchmark (use the same tokenization used to build
`ngrams`/`_normalise`) and check whether any contiguous token window equals the
`ngram` (or alternatively pad both `nb` and `ngram` with explicit boundary
markers before doing `in`); update the loop in the verification code that
iterates over `ngrams`, `normalised_benchmarks`, and `nb` to perform token-level
comparison (e.g., split `nb` into tokens and compare `"
".join(tokens[i:i+len_ngram]) == ngram`) so matches respect token boundaries.

---

Duplicate comments:
In `@openverifiablellm/contamination.py`:
- Around line 118-133: get_ngrams currently accepts n <= 0 and will generate
empty-string n-grams which then pollute bloom filters; add an explicit guard at
the top of get_ngrams to reject non-positive n (e.g. raise ValueError when n <=
0) before any normalization/tokenisation, so build_bloom_filter and
check_contamination never see an empty-string n-gram; reference the get_ngrams
function and ensure callers (e.g. build_bloom_filter / check_contamination
paths) will receive the error rather than inserting "" into the filter.

In `@openverifiablellm/utils.py`:
- Around line 40-46: The Merkle construction is performing duplicate SHA-256
work and has an unreachable return: call compute_sha256 exactly once per input
chunk (replace the two identical calls that produce leaf_hex with a single call
and append bytes.fromhex(leaf_hex) once), and remove the duplicated return so
the empty-tree case returns compute_sha256(data=b"") only once; apply the same
deduping fix where parent hashes are computed (ensure each parent node is hashed
a single time via compute_sha256 and no duplicated calls remain).
- Around line 263-270: The current contamination_metadata only stores benchmark
names; change it to persist per-benchmark snapshot metadata (not just names) so
the check is reproducible: when building contamination_metadata in the
check_contamination branch, replace benchmarks_used list of strings with a list
of dicts containing unique snapshot fields (e.g., "name"/"id", "source"
(hf/local), "path" or "config", "revision"/"commit" if available, and a
content_digest/hash for the snapshot or file); compute content_digest for local
files (e.g., sha256 of file bytes) and capture dataset revision/sha or dataset
fingerprint for HuggingFace datasets, and ensure redacted_chunks remains
included; update any callers that expect benchmarks_used to handle the new
list-of-dicts structure (look for uses of contamination_metadata and the
variable benchmarks_used).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: e6d5f329-2beb-4897-a11c-e00dd2ddaefe

📥 Commits

Reviewing files that changed from the base of the PR and between 3b8746a and d81e9fd.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • openverifiablellm/config.py
  • openverifiablellm/contamination.py
  • openverifiablellm/utils.py
  • pyproject.toml
  • tests/test_contamination.py

@github-actions github-actions bot added size/XL and removed size/XL labels Mar 10, 2026
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 11, 2026
@DhruvK278
Copy link
Author

@Archit381 Please review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 12, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@github-actions github-actions bot added size/XL and removed size/XL labels Mar 12, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openverifiablellm/contamination.py`:
- Around line 251-259: The hot path inside check_contamination rebuilds
normalised_benchmarks on every Bloom hit (see normalised_benchmarks, _normalise,
benchmark_texts), causing repeated O(number of benchmarks) work; precompute the
set of normalized benchmark texts once (e.g., in the preprocessing entry point
that calls extract_text_from_xml) and change check_contamination to accept a
precomputed_normalised_benchmarks parameter instead of constructing it inline,
then use that set for the inner exact-match loop to avoid rebuilding it on each
invocation.
- Around line 164-167: The current inputs_hash is computed by concatenating raw
benchmark_texts bytes which can collide across different segmentations; update
the hashing in the function that computes inputs_hash (use the hasher variable
near the loop over benchmark_texts) to include an unambiguous delimiter or
length-prefix for each text (e.g., write each text length then the text or
serialize each entry with a structured encoding) before hasher.update, so the
resulting inputs_hash cannot be forged by different concatenations of the same
bytes.
- Around line 72-84: The current load in contamination.py uses
load_dataset(dataset_id, trust_remote_code=trust_remote_code) which returns all
splits and pulls non-evaluation data into texts; change this to load only
evaluation splits (e.g., pass split="test" or "validation" as appropriate) or
make the split selection configurable via a new parameter (e.g.,
evaluation_splits or split_map) so that GSM8K uses "test" and MMLU uses
"test"/"validation"; adjust the call site that constructs ds (still handling
exceptions and raising DatasetLoadError) and then iterate over only those chosen
splits to populate texts (respecting _TEXT_FIELDS and the existing texts append
logic).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 674750de-288e-4aa4-83eb-9d730e04de24

📥 Commits

Reviewing files that changed from the base of the PR and between 15c6e2a and 0123902.

📒 Files selected for processing (3)
  • openverifiablellm/config.py
  • openverifiablellm/contamination.py
  • tests/test_contamination.py

@Archit381
Copy link
Member

fix lint issues and resolve conflicts

@github-actions github-actions bot added size/XL and removed size/XL labels Mar 12, 2026
@DhruvK278 DhruvK278 force-pushed the feat/data-contamination-detection branch from fd7475d to 15c6e2a Compare March 12, 2026 18:52
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 12, 2026
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 12, 2026
@DhruvK278
Copy link
Author

fix lint issues and resolve conflicts

Done!!

@DhruvK278 DhruvK278 force-pushed the feat/data-contamination-detection branch from 9e5638f to ebe7eb5 Compare March 13, 2026 09:32
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 13, 2026
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 13, 2026
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 13, 2026
@DhruvK278
Copy link
Author

DhruvK278 commented Mar 14, 2026

@Archit381 Please review. I have fixed the conflict and lint issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Data Contamination Detection via Configurable Benchmark Filtering

2 participants