Implement Manifest Chain Tamper Detection for cryptographically linked dataset manifests.#53
Implement Manifest Chain Tamper Detection for cryptographically linked dataset manifests.#53aniket866 wants to merge 10 commits intoAOSSIE-Org:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a tamper-evident manifest chain: new Changes
Sequence DiagramsequenceDiagram
participant User
participant Preprocessing
participant ManifestChain
participant FileSystem
participant Verification
User->>Preprocessing: run preprocessing
Preprocessing->>FileSystem: read existing `dataset_manifest.json`?
FileSystem-->>Preprocessing: manifest JSON or not found
Preprocessing->>ManifestChain: compute_manifest_hash(previous_manifest)
ManifestChain->>FileSystem: read previous manifest (if present)
FileSystem-->>ManifestChain: previous manifest JSON
ManifestChain->>ManifestChain: canonicalize JSON -> SHA-256
ManifestChain-->>Preprocessing: parent_manifest_hash
Preprocessing->>FileSystem: write new `dataset_manifest.json` (includes parent_manifest_hash)
User->>Verification: run verification (--previous-manifest optional)
Verification->>ManifestChain: verify_manifest_chain or verify_manifest_chain_link
ManifestChain->>FileSystem: read current and previous manifests as needed
FileSystem-->>ManifestChain: manifest JSONs
ManifestChain->>ManifestChain: compute previous hash, compare to stored parent_manifest_hash
ManifestChain-->>Verification: chain valid / chain tampered / report details
Verification-->>User: verification report (includes chain check)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/manifest_chain.py`:
- Around line 90-96: The hashing routine currently removes
"parent_manifest_hash" before canonicalizing and hashing; include the full
manifest (do not pop "parent_manifest_hash") so the manifest hash covers the
parent link (adjust the code around the variable hashable and the call to
_canonical_json to stop removing that key), and update the corresponding
assertion in tests/test_manifest_chain.py to expect that changing a manifest's
parent_manifest_hash will break/alter the child's verification (flip the test
assertion accordingly).
- Around line 253-266: The branch that handles an explicit
previous_manifest_path calls verify_manifest_chain_link(previous_manifest_path,
manifest_data) but lets exceptions bubble up; change verify_manifest_chain() to
catch exceptions from verify_manifest_chain_link and return a structured report
instead of raising: call verify_manifest_chain_link in a try/except, set
chain_valid to False on exception, set chain_message to a clear message
including the exception text (e.g., "Chain link broken — could not read or
verify previous manifest: <error>"), and still return has_parent_hash_field and
parent_hash_value alongside chain_valid and chain_message so this exit path
matches the other structured-report returns.
In `@openverifiablellm/verify.py`:
- Around line 267-300: The branch currently treats a provided but missing
previous_manifest_path as if no previous was requested; change the logic so that
when previous_manifest_path is not None but .exists() is False you immediately
add a CheckResult named "manifest_chain_link" with status CheckStatus.FAIL and a
clear detail like "previous manifest provided but cannot be opened/missing",
rather than falling through to the manifest_chain_awareness path; keep the
existing try/except around verify_manifest_chain_link(previous_manifest_path,
manifest) for the exists() True case, and leave
verify_manifest_chain(manifest_path) only for the explicit None case; add a
unit/regression test covering previous_manifest_path provided but missing that
expects manifest_chain_link to be FAIL.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: dbe369c3-eae0-49d4-a310-c6231ca347b1
📒 Files selected for processing (4)
openverifiablellm/manifest_chain.pyopenverifiablellm/utils.pyopenverifiablellm/verify.pytests/test_manifest_chain.py
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
openverifiablellm/manifest_chain.py (1)
90-96:⚠️ Potential issue | 🟠 MajorHash the full previous manifest, including its parent link.
parent_manifest_hashis not self-referential here; it authenticates the previous manifest. Dropping it means a child still verifies after someone edits only the previous manifest’s own parent pointer, so ancestry tampering remains invisible.Suggested fix
- # Create a copy and remove parent_manifest_hash before hashing - # (so the hash doesn't include itself) - hashable = manifest_data.copy() - hashable.pop("parent_manifest_hash", None) - - canonical = _canonical_json(hashable) + canonical = _canonical_json(manifest_data) return hashlib.sha256(canonical.encode("utf-8")).hexdigest()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/manifest_chain.py` around lines 90 - 96, The code is removing parent_manifest_hash before computing the manifest hash, which omits the parent's link and allows ancestry tampering; stop popping "parent_manifest_hash" so the hash includes the full previous manifest link — i.e., compute the canonical JSON from manifest_data (or from the "hashable" copy without removing parent_manifest_hash) and then call _canonical_json(...) and hashlib.sha256(...) on that result so the resulting hexdigest incorporates parent_manifest_hash; locate the block that creates "hashable = manifest_data.copy()", the pop("parent_manifest_hash", None) call, and replace it by leaving parent_manifest_hash intact before calling _canonical_json and hashlib.sha256.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openverifiablellm/manifest_chain.py`:
- Around line 272-282: The current branch in manifest_chain.py sets
chain_valid=True when parent_manifest_hash exists and is non-empty, which lets
unverified/non-root manifests pass; change the logic so that when has_field is
True and parent_hash_value is non-empty you do NOT mark chain_valid=True by
default—instead set chain_valid=False and a message like "parent_manifest_hash
present but previous_manifest_path not provided (cannot verify non-root
manifest)"; only set chain_valid=True for the empty parent_hash_value (root) or
after explicit verification of the previous manifest when previous_manifest_path
(and its verification routine) is present and succeeds. Reference the
variables/logic handling has_field, parent_hash_value, chain_valid and
previous_manifest_path or the verification function to locate and update the
code.
---
Duplicate comments:
In `@openverifiablellm/manifest_chain.py`:
- Around line 90-96: The code is removing parent_manifest_hash before computing
the manifest hash, which omits the parent's link and allows ancestry tampering;
stop popping "parent_manifest_hash" so the hash includes the full previous
manifest link — i.e., compute the canonical JSON from manifest_data (or from the
"hashable" copy without removing parent_manifest_hash) and then call
_canonical_json(...) and hashlib.sha256(...) on that result so the resulting
hexdigest incorporates parent_manifest_hash; locate the block that creates
"hashable = manifest_data.copy()", the pop("parent_manifest_hash", None) call,
and replace it by leaving parent_manifest_hash intact before calling
_canonical_json and hashlib.sha256.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6a8c0e79-a83b-480b-9d53-0b43f2724ad8
📒 Files selected for processing (1)
openverifiablellm/manifest_chain.py
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
openverifiablellm/utils.py (3)
1-13:⚠️ Potential issue | 🟡 MinorSort this import block to unblock CI.
Ruff is already failing on this file’s import ordering, so this needs to be normalized before merge.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 1 - 13, The import block in utils.py is not sorted per the project's style (Ruff/isort), causing CI failures; reorder the imports into standard library first (bz2, hashlib, json, logging, platform, re, sys, pathlib.Path), then third-party (defusedxml.ElementTree as ET), then local application imports (from openverifiablellm.environment import generate_environment_fingerprint and from openverifiablellm.manifest_chain import get_parent_manifest_hash), and remove any duplicate/unused imports; you can run isort or manually adjust the order and ensure typing imports (Union, Optional, Dict, Any, List, Tuple) are grouped with standard library imports.
401-405:⚠️ Potential issue | 🟠 MajorRemove the second
extract_text_from_xml()call.The module entrypoint preprocesses the dump twice. That doubles the runtime, and the second pass rewrites
wiki_clean.txtafter the manifest has already been recorded, so a later failure can leave the manifest describing different on-disk state than the final output.Suggested fix
extract_text_from_xml( sys.argv[1], write_manifest="--no-manifest" not in sys.argv[2:], ) - extract_text_from_xml(sys.argv[1])🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 401 - 405, The file calls extract_text_from_xml twice (once with write_manifest set based on sys.argv and immediately again without arguments), causing duplicated work and a manifest mismatch; remove the second extract_text_from_xml(sys.argv[1]) invocation so extract_text_from_xml is only called once with the intended write_manifest argument (leave the existing extract_text_from_xml(sys.argv[1], write_manifest="--no-manifest" not in sys.argv[2:]) call intact).
267-269:⚠️ Potential issue | 🟠 MajorPreserve the old
chunk_sizedefault inexport_merkle_proof().This change turns a previously optional parameter into a required one, which is a backwards-incompatible runtime break for existing callers that relied on the default chunk size. Keep
chunk_sizedefaulted and make the newoutput_pathparameter keyword-only instead.Suggested fix
def export_merkle_proof( - proof: List[Tuple[str, bool]], chunk_index: int, chunk_size: int, output_path: Union[str, Path] + proof: List[Tuple[str, bool]], + chunk_index: int, + chunk_size: int = MERKLE_CHUNK_SIZE_BYTES, + *, + output_path: Union[str, Path], ) -> None:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/utils.py` around lines 267 - 269, Restore the previous default for chunk_size in export_merkle_proof by making chunk_size optional again (e.g. chunk_size: int = DEFAULT_CHUNK_SIZE or = 1024 if no constant exists) and make output_path a keyword-only parameter by adding a '*' before it (signature: def export_merkle_proof(..., chunk_index: int, chunk_size: int = DEFAULT_CHUNK_SIZE, *, output_path: Union[str, Path]) -> None). Ensure DEFAULT_CHUNK_SIZE is defined or use the original numeric default and update any docstring/tests if they reference the former behavior.openverifiablellm/verify.py (1)
20-39:⚠️ Potential issue | 🟡 MinorNormalize the import block here as well.
CI is failing on Ruff import ordering for this file, so the new imports need to be re-sorted before this can merge.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openverifiablellm/verify.py` around lines 20 - 39, Reorder the imports to satisfy Ruff: group standard-library imports first (sorted alphabetically), then third-party (none), then local package imports (sorted). In particular, sort the stdlib imports so dataclasses and enum come before json/logging/os, with pathlib and typing placed alphabetically (e.g., dataclasses, enum, json, logging, os, pathlib, platform, shutil, subprocess, sys, tempfile, typing). Then place local imports from openverifiablellm in alphabetical order by module and alphabetize imported names — e.g., import utils first, then from openverifiablellm.environment import generate_environment_fingerprint, and from openverifiablellm.manifest_chain import verify_manifest_chain, verify_manifest_chain_link (members sorted alphabetically).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@openverifiablellm/utils.py`:
- Around line 1-13: The import block in utils.py is not sorted per the project's
style (Ruff/isort), causing CI failures; reorder the imports into standard
library first (bz2, hashlib, json, logging, platform, re, sys, pathlib.Path),
then third-party (defusedxml.ElementTree as ET), then local application imports
(from openverifiablellm.environment import generate_environment_fingerprint and
from openverifiablellm.manifest_chain import get_parent_manifest_hash), and
remove any duplicate/unused imports; you can run isort or manually adjust the
order and ensure typing imports (Union, Optional, Dict, Any, List, Tuple) are
grouped with standard library imports.
- Around line 401-405: The file calls extract_text_from_xml twice (once with
write_manifest set based on sys.argv and immediately again without arguments),
causing duplicated work and a manifest mismatch; remove the second
extract_text_from_xml(sys.argv[1]) invocation so extract_text_from_xml is only
called once with the intended write_manifest argument (leave the existing
extract_text_from_xml(sys.argv[1], write_manifest="--no-manifest" not in
sys.argv[2:]) call intact).
- Around line 267-269: Restore the previous default for chunk_size in
export_merkle_proof by making chunk_size optional again (e.g. chunk_size: int =
DEFAULT_CHUNK_SIZE or = 1024 if no constant exists) and make output_path a
keyword-only parameter by adding a '*' before it (signature: def
export_merkle_proof(..., chunk_index: int, chunk_size: int = DEFAULT_CHUNK_SIZE,
*, output_path: Union[str, Path]) -> None). Ensure DEFAULT_CHUNK_SIZE is defined
or use the original numeric default and update any docstring/tests if they
reference the former behavior.
In `@openverifiablellm/verify.py`:
- Around line 20-39: Reorder the imports to satisfy Ruff: group standard-library
imports first (sorted alphabetically), then third-party (none), then local
package imports (sorted). In particular, sort the stdlib imports so dataclasses
and enum come before json/logging/os, with pathlib and typing placed
alphabetically (e.g., dataclasses, enum, json, logging, os, pathlib, platform,
shutil, subprocess, sys, tempfile, typing). Then place local imports from
openverifiablellm in alphabetical order by module and alphabetize imported names
— e.g., import utils first, then from openverifiablellm.environment import
generate_environment_fingerprint, and from openverifiablellm.manifest_chain
import verify_manifest_chain, verify_manifest_chain_link (members sorted
alphabetically).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: b148a511-c1d7-46b7-bff5-43ad23dc2fc3
📒 Files selected for processing (2)
openverifiablellm/utils.pyopenverifiablellm/verify.py
|
Hi @Archit381 I have resolved the conflicts and also have updated PR descrition instructions with UV support |
|
@aniket866 Fix linting issues and conflicts |
I have fixed both |
|
@aniket866 Can we avoid in-line python code? Wrap them in a script |
|
hi @Archit381 I have done the suggested changes and also have updated the PR descriptions with Script/ commands |
Addressed Issues:
Closes #35
Step 1: Download dump
Windows / Linux / Mac
Step 2: First preprocessing
Windows / Linux / Mac
Step 3: Backup manifest
Windows (CMD / PowerShell)
Linux / Mac
Step 4: Second preprocessing (creates chain)
Windows / Linux / Mac
Step 5: Check manifest has parent hash
Windows / Linux / Mac
uv run python -c "import json; m = json.load(open('data/dataset_manifest.json')); print('parent_manifest_hash:', m.get('parent_manifest_hash')[:16] + '...')"Step 6: Verify without chain (should PASS)
Windows / Linux / Mac
Step 7: Verify WITH chain (should PASS - chain is valid)
Windows
Linux / Mac
uv run python scripts/verify_dataset.py simplewiki-20260201-pages-articles.xml.bz2 \\ --previous-manifest data/manifest_v1.jsonStep 8: Tamper the manifest
Windows / Linux / Mac
uv run python -c "import json; m = json.load(open('data/manifest_v1.json')); m['tampered'] = True; json.dump(m, open('data/manifest_v1.json', 'w'), indent=2); print(' Tampered')"Step 9: Verify with tampered manifest (should FAIL - tampering detected!)
Windows
Linux / Mac
uv run python scripts/verify_dataset.py simplewiki-20260201-pages-articles.xml.bz2 \\ --previous-manifest data/manifest_v1.jsonStep 10: Restore manifest
Windows / Linux / Mac
uv run python -c "import json; m = json.load(open('data/manifest_v1.json')); del m['tampered']; json.dump(m, open('data/manifest_v1.json', 'w'), indent=2); print(' Restored')"Step 11: Verify restored (should PASS again)
Windows
Linux / Mac
uv run python scripts/verify_dataset.py simplewiki-20260201-pages-articles.xml.bz2 \\ --previous-manifest data/manifest_v1.jsonIf all steps above succeed, the feature has been successfully validated.
Screenshots/Recordings:
Additional Notes:
Before Tampering verification passes
After Tampering Verification Fails
Purpose
This PR introduces cryptographically linked dataset manifests to detect tampering in preprocessing outputs.
Each new manifest stores the SHA256 hash of the previous manifest using the field:
If any earlier manifest is modified, the stored hash will no longer match and verification will fail.
Screenshots/Recordings:
Additional Notes:
Before Tampering verifcation passes
After Tampering Verification Fails:
Purpose
This PR introduces cryptographically linked dataset manifests to detect tampering in preprocessing outputs.
Each new manifest stores the SHA256 hash of the previous manifest using the field:
If any earlier manifest is modified, the stored hash will no longer match and verification will fail.
Checklist
We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.
Summary by CodeRabbit
New Features
Tests