Skip to content

[EXPERIMENT] Checkpoint reproducibility and entropy sources in PyTorch training#73

Open
ryoari wants to merge 7 commits intoAOSSIE-Org:mainfrom
ryoari:experiment/checkpoint-reproducibility
Open

[EXPERIMENT] Checkpoint reproducibility and entropy sources in PyTorch training#73
ryoari wants to merge 7 commits intoAOSSIE-Org:mainfrom
ryoari:experiment/checkpoint-reproducibility

Conversation

@ryoari
Copy link
Contributor

@ryoari ryoari commented Mar 14, 2026

Addressed Issues:

#62
Fixes ( #62 )

Summary

This PR implements an experiment to validate whether identical training runs produce identical model checkpoints.

This question came up in a Discord discussion about whether checkpoint hashing could serve as a verification mechanism for deterministic training.

Before building a full verification pipeline, we need to confirm that training itself is reproducible.

Design Relevance

OpenVerifiableLLM relies on the assumption that a training run can be verified by reproducing it and comparing model checkpoints.

For this to work, two identical training runs must produce identical checkpoint bytes.

This experiment validates that assumption and identifies the entropy sources that must be controlled for deterministic training.

The results show that deterministic checkpoints require controlling:

  • PyTorch RNG state
  • deterministic algorithm settings
  • checkpoint serialization format

These findings can help inform the design of a deterministic training configuration for the project.

What this experiment tests

The experiment investigates several potential sources of entropy in a simple PyTorch training loop.

Tested factors include:

• torch RNG seeding
• numpy / python RNG seeding
• dataset shuffle order
• dropout randomness
• checkpoint serialization format
• deterministic algorithm mode

Key findings

From the experiments:

Torch RNG must be seeded – otherwise training is nondeterministic
Deterministic algorithms must be enabled
Checkpoint format can introduce entropy (torch.save vs deterministic save)
numpy and python RNG do not affect a pure PyTorch training loop
Dropout and shuffle remain deterministic when the torch RNG is seeded

Experiment structure

The experiment consists of:

run_experiment.py
A broad sweep that toggles entropy sources and prints checkpoint hashes.

test_entropy_sources.py
Pytest suite validating determinism assumptions.

Documentation:

• WHY.md – motivation and reasoning
• RESULTS.md – experiment output and interpretation
• SOURCES.md – references

Why this matters for OpenVerifiableLLM

The project aims to allow independent verification of model training.

For checkpoint hashing to be a reliable verification mechanism, training must be reproducible under controlled conditions.

This experiment demonstrates:

• which entropy sources matter
• which ones do not
• what must be controlled in the training pipeline

These results can inform the design of a deterministic training configuration for the project.

How to run

cd experiments/checkpoint_reproducibility

python run_experiment.py
pytest test_entropy_sources.py -v

Screenshots/Recordings:

output:

run_experiment.py

image

test_entropy_sources.py

image

Additional Notes:

AI Usage Disclosure

AI tools were used as a pair-programming and documentation aid, primarily for brainstorming potential entropy sources and improving documentation clarity.

All experiment design, implementation, debugging, and validation were performed locally by me.

Checklist

  • [ x ] My code follows the project's code style and conventions
  • [ x ] I have made corresponding changes to the documentation
  • [ x ] My changes generate no new warnings or errors
  • [ x ] I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • [ x ] I have read the Contributing Guidelines

Summary by CodeRabbit

  • New Features

    • Added comprehensive checkpoint reproducibility experiment with documentation, test suite, and utilities for validating deterministic model training and serialization.
  • Dependencies

    • Updated minimum Python requirement to 3.10+
    • Added NumPy and PyTorch as project dependencies
  • Style

    • Minor formatting improvements for code consistency

@github-actions github-actions bot added backend configuration documentation Improvements or additions to documentation python size/XL labels Mar 14, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 14, 2026

Walkthrough

This PR adds a checkpoint reproducibility experiment: documentation (README, WHY, RESULTS, SOURCES), a deterministic experiment runner and utilities (run_experiment.py, utils.py), a comprehensive pytest suite (test_entropy_sources.py), and updates project metadata to require Python ≥3.10 and add numpy and torch dependencies.

Changes

Cohort / File(s) Summary
Documentation
experiments/checkpoint_reproducibility/README.md, experiments/checkpoint_reproducibility/WHY.md, experiments/checkpoint_reproducibility/RESULTS.md, experiments/checkpoint_reproducibility/SOURCES.md
New docs describing experiment purpose, hypotheses, findings, run instructions, results with hash comparisons, bibliography, and platform notes (Windows DataLoader behavior).
Experiment scripts & tests
experiments/checkpoint_reproducibility/run_experiment.py, experiments/checkpoint_reproducibility/test_entropy_sources.py
New deterministic experiment runner and a multi-part test suite (Q1–Q6) exploring RNG sources, save-format effects, data-order, and edge cases; includes TinyModel, training harness, deterministic save, and pytest xfail annotations for expected nondeterminism.
Utilities
experiments/checkpoint_reproducibility/utils.py
New helper functions: set_seed(seed) to seed Python/NumPy/PyTorch and enable deterministic algorithms, and hash_file(path) to compute SHA-256 of files.
Project config
pyproject.toml
Project metadata updated: authors changed to "OpenVerifiableLLM Contributors", requires-python raised to >=3.10, and dependencies added: numpy, torch.
Minor style
openverifiablellm/verify.py
Small formatting change in an else branch (no behavioral change).

Sequence Diagram(s)

sequenceDiagram
    participant Runner as Experiment Runner
    participant Dataset as Dataset Factory
    participant Model as TinyModel
    participant Trainer as Training Loop
    participant Saver as Checkpoint Saver
    participant Hasher as SHA-256 Hasher

    Runner->>Dataset: make_dataset(base_seed)
    Dataset-->>Runner: TensorDataset (seeded)
    Runner->>Model: initialize TinyModel (base_seed)
    Model-->>Runner: model instance
    Runner->>Trainer: set seeds (torch/numpy/python)
    Trainer->>Model: train for N epochs (forward/backward)
    Trainer-->>Runner: trained model / state_dict
    Runner->>Saver: save checkpoint (deterministic or torch.save)
    alt deterministic save
        Saver->>Saver: write raw state_dict bytes
    else torch.save
        Saver->>Saver: write container (torch.save)
    end
    Saver-->>Runner: checkpoint file
    Runner->>Hasher: compute SHA-256
    Hasher-->>Runner: file hash
    Runner->>Runner: compare hashes vs baseline / other runs
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

Python Lang, Documentation

Suggested reviewers

  • Archit381

Poem

🐰 I hopped through seeds and bytes so small,

I guarded hashes, one and all.
Checkpoints lined up, neat and bright,
Deterministic dreams take flight—hop right! 🥕✨

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main focus of the changeset: a new experiment to investigate checkpoint reproducibility and entropy sources in PyTorch training, which aligns with the PR's core objective of validating deterministic model checkpoints.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added size/XL and removed size/XL labels Mar 14, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@experiments/checkpoint_reproducibility/README.md`:
- Around line 51-56: The README.md file's final section "Windows note" is
missing a trailing newline which triggers markdownlint rule MD047; open the
README.md (the "Windows note" section) and add a single newline character at the
end of the file so the file ends with exactly one trailing newline.
- Around line 46-47: Update the README expected pytest result text that
currently reads "14 passed, 0 failed" (the line mentioning `run_experiment.py`
output) to match the actual pytest summary which includes xfail/xpass counts;
locate the "Expected: `run_experiment.py` shows..." sentence and replace the
simple "14 passed, 0 failed" wording with the real test summary that includes
xfailed/xpassed (or a generic phrasing like "tests pass with xfailed/xpassed
reported") so the README reflects the xfail/xpass cases.

In `@experiments/checkpoint_reproducibility/RESULTS.md`:
- Around line 79-85: The file ends without a final newline which violates
markdownlint MD047; update the file
"experiments/checkpoint_reproducibility/RESULTS.md" by adding a single trailing
newline character at the end of the file so the last line (the
"Multiprocessing..." bullet) is terminated with a newline and the file ends with
exactly one newline.
- Around line 66-75: The RESULT.md section references old test names (e.g.,
test_torch_save_byte_stability, test_fixed_seed_isolates_save_format_entropy,
test_fixed_seed_plus_deterministic_save) which no longer match the actual test
functions in test_entropy_sources.py (they are now named test_q*); update the
text to use the current test function names from test_entropy_sources.py (or add
a short parenthetical mapping from the old names to the new test_q* names) so
readers can trace results to the real tests (search for test_q in
test_entropy_sources.py to find the exact identifiers to use).
- Around line 31-35: The pytest summary under the "## pytest" header currently
shows only "14 passed, 0 failed" which omits xfail/xpass outcomes; update the
fenced code block to reflect the full test result counts (e.g., passed, failed,
xfailed, xpassed, skipped) produced by the current test run so the document
matches reality—re-run pytest (or use the saved CI output) to get the exact
counts and replace the existing block text accordingly while preserving the
Markdown fenced code block format.

In `@experiments/checkpoint_reproducibility/run_experiment.py`:
- Around line 41-47: The save_deterministic serializer in run_experiment.py is
duplicated in test_entropy_sources.py; extract this logic into a single shared
helper (e.g., create a new function save_deterministic in a small module like
experiments.checkpoint_reproducibility.utils or experiments.utils) that
preserves the exact behavior (write k.encode() then
v.detach().cpu().contiguous().numpy().tobytes()), then replace the local
definitions in both run_experiment.py and test_entropy_sources.py with an import
of the shared save_deterministic function and remove the duplicate
implementations.
- Around line 105-166: The run() function currently builds a files list then
performs many train_and_save calls; if an exception occurs the final cleanup
loop may never run leaving artifacts—wrap the main experiment loop (everything
after files = [] up to before the final deletion) in a try/finally so the
deletion logic always executes; ensure files is declared before the try, move
the for f in files: if os.path.exists(f): os.remove(f) into the finally block,
and keep using the same path names created earlier (references: run, files,
train_and_save, print_result) so that any checkpoints are removed even on error.
- Around line 1-169: Ruff reports style issues in this module; run the
auto-formatter (ruff --fix) on this file to apply import grouping/ordering,
spacing, blank-line and line-length fixes, and then re-run CI; focus changes
around the top-level imports and the definitions of make_dataset,
save_deterministic, train_and_save, and run to ensure proper blank lines and
PEP8-compliant spacing so tests pass.

In `@experiments/checkpoint_reproducibility/test_entropy_sources.py`:
- Around line 136-140: The test currently writes and compares files using
hard-coded paths (torch.save(sd, "/tmp/_q2a.pt"), torch.save(sd,
"/tmp/_q2b.pt"), hash_file, os.remove), which is brittle on Windows/CI; change
the test to accept the pytest tmp_path fixture and create file paths under
tmp_path (e.g., tmp_path / "_q2a.pt" and tmp_path / "_q2b.pt"), use those paths
for torch.save and hash_file, and remove the explicit os.remove calls (or remove
via tmp_path cleanup) so the test becomes platform-independent; apply the same
replacement pattern wherever torch.save/hash_file/os.remove are called (notably
the other ranges mentioned).
- Around line 1-323: Run Ruff autoformat on this test file and apply fixes to
satisfy the linter: execute ruff --fix on
experiments/checkpoint_reproducibility/test_entropy_sources.py to correct import
ordering, spacing, line-length/indentation issues, and any unused-import or
formatting offenses; ensure the file compiles and tests still run (preserve
function/class names TinyModel, train, save_deterministic and all test_*
functions) and commit the updated file.
- Around line 19-20: The sys.path insertion adds the parent directory one level
too high so the import in this file fails; change the inserted path to the test
file's directory (the directory containing
experiments/checkpoint_reproducibility) so that utils.py can be found — update
the sys.path.insert call that currently uses
os.path.dirname(os.path.dirname(os.path.abspath(__file__))) to instead insert
the directory holding this test (e.g., use
os.path.dirname(os.path.abspath(__file__))) before the from utils import
hash_file import so the module resolves correctly.
- Around line 68-70: The train() function mutates global flags
torch.use_deterministic_algorithms, torch.backends.cudnn.deterministic, and
torch.backends.cudnn.benchmark without restoring them; wrap the section that
sets these flags in a try/finally: before changing, read and save the current
values of torch.use_deterministic_algorithms(),
torch.backends.cudnn.deterministic, and torch.backends.cudnn.benchmark into
local variables, then set the desired values, run the training logic, and in the
finally block restore the saved values back via
torch.use_deterministic_algorithms(saved), torch.backends.cudnn.deterministic =
saved, and torch.backends.cudnn.benchmark = saved to avoid leaking state to
other tests.

In `@experiments/checkpoint_reproducibility/utils.py`:
- Around line 1-19: Run Ruff autoformat on
experiments/checkpoint_reproducibility/utils.py and apply the suggested
reformatting changes: reorder and group imports (standard libs first: hashlib,
random; then third-party: numpy as np, torch), ensure a blank line between
imports and top-level functions, enforce two blank lines between the top-level
defs set_seed and hash_file, and fix any whitespace/line-ending issues so Ruff
passes; verify the functions set_seed and hash_file remain unchanged
semantically after formatting.
- Around line 17-19: hash_file currently reads the entire file into memory which
fails for large checkpoints; change it to stream the file in fixed-size chunks
(e.g., 64KB or 65536 bytes) and call hashlib.sha256().update(chunk) in a loop
until EOF, then return hasher.hexdigest(); keep the same function name hash_file
and signature in utils.py and ensure the file is opened in "rb" mode as it is
now.

In `@experiments/checkpoint_reproducibility/WHY.md`:
- Around line 68-73: The file ends without a trailing newline which triggers
markdownlint MD047; edit experiments/checkpoint_reproducibility/WHY.md and add a
single newline character at the end of the file (after the last bullet
"Distributed training: gradient averaging across workers introduces ordering
issues that are hard to control. Out of scope.") so the file ends with exactly
one trailing newline.
- Around line 18-19: Update the documentation references that mention outdated
test names (e.g., test_same_seed_same_weights and
test_fixed_seed_isolates_save_format_entropy) to the current test function names
used in the suite (look for the test_q* functions); find the mentions in WHY.md
and replace them with the exact current test identifiers (or add a parenthetical
mapping) so readers can directly correlate the narrative to the actual tests
(search for
test_same_seed_same_weights/test_fixed_seed_isolates_save_format_entropy and
substitute the correct test_q... names or add a short note mapping old→new).

In `@pyproject.toml`:
- Line 10: The Ruff configuration's target Python version currently set to py39
must be updated to match requires-python = ">=3.10"; locate the Ruff settings in
pyproject.toml (the ruff section and the target-version key) and change the
target from "py39" to "py310" (or the equivalent list including "py310") so
Ruff's linting/formatting targets Python 3.10+ consistently with
requires-python.
- Around line 16-17: Remove "numpy" and "torch" from the top-level dependencies
list in pyproject.toml and add them to an optional dependency group (for example
[project.optional-dependencies] under a group like "experiments" or "ml");
ensure the group contains the appropriate specifiers for numpy and torch so
users can opt into them (e.g., poetry/PEP 621 style optional deps). Reference
that numpy is only required by experiments/checkpoint_reproducibility and torch
is already safe in openverifiablellm/environment.py (keeps its try-except
fallback), so do not change code there—only move the two packages into the
optional-dependencies section and update any docs/README to mention how to
install extras (e.g., pip install .[experiments]).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0e4a040c-39b0-4b08-82a8-3c13b92d4cec

📥 Commits

Reviewing files that changed from the base of the PR and between c352df0 and f871f7d.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (9)
  • _ckpt_Serialization
  • experiments/checkpoint_reproducibility/README.md
  • experiments/checkpoint_reproducibility/RESULTS.md
  • experiments/checkpoint_reproducibility/SOURCES.md
  • experiments/checkpoint_reproducibility/WHY.md
  • experiments/checkpoint_reproducibility/run_experiment.py
  • experiments/checkpoint_reproducibility/test_entropy_sources.py
  • experiments/checkpoint_reproducibility/utils.py
  • pyproject.toml

Comment on lines +46 to +47
Expected: `run_experiment.py` shows one genuinely non-deterministic result
(unseeded torch RNG). Pytest is 14 passed, 0 failed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Refresh expected pytest result text.

The current test file includes xfail/xpass cases, so the expected output should not be documented as only 14 passed, 0 failed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@experiments/checkpoint_reproducibility/README.md` around lines 46 - 47,
Update the README expected pytest result text that currently reads "14 passed, 0
failed" (the line mentioning `run_experiment.py` output) to match the actual
pytest summary which includes xfail/xpass counts; locate the "Expected:
`run_experiment.py` shows..." sentence and replace the simple "14 passed, 0
failed" wording with the real test summary that includes xfailed/xpassed (or a
generic phrasing like "tests pass with xfailed/xpassed reported") so the README
reflects the xfail/xpass cases.

Comment on lines +51 to +56
## Windows note

`num_workers > 0` is skipped automatically on Windows. PyTorch uses `spawn` for
DataLoader workers which re-imports the script in each process. Making it work
needs more setup than the experiment is worth. See RESULTS.md for expected Linux
behaviour. No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a trailing newline to satisfy markdownlint MD047.

Static analysis indicates this Markdown file should end with a single newline.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@experiments/checkpoint_reproducibility/README.md` around lines 51 - 56, The
README.md file's final section "Windows note" is missing a trailing newline
which triggers markdownlint rule MD047; open the README.md (the "Windows note"
section) and add a single newline character at the end of the file so the file
ends with exactly one trailing newline.

Comment on lines +31 to +35
## pytest

```
14 passed, 0 failed
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update the pytest summary to match the current suite outcomes.

The suite currently includes xfail/xpass scenarios; documenting only 14 passed, 0 failed is misleading for reproducibility interpretation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@experiments/checkpoint_reproducibility/RESULTS.md` around lines 31 - 35, The
pytest summary under the "## pytest" header currently shows only "14 passed, 0
failed" which omits xfail/xpass outcomes; update the fenced code block to
reflect the full test result counts (e.g., passed, failed, xfailed, xpassed,
skipped) produced by the current test run so the document matches reality—re-run
pytest (or use the saved CI output) to get the exact counts and replace the
existing block text accordingly while preserving the Markdown fenced code block
format.

Comment on lines +66 to +75
**test_torch_save_byte_stability**
Expected to fail (assert not match), but `torch.save` gave identical hashes
on this version. The test documents this outcome — see above.

**test_fixed_seed_isolates_save_format_entropy**
In-memory weights match (seed is fixed). File bytes also match on PyTorch 2.x.
The test prints the result either way — it's documentation, not a pass/fail gate.

**test_fixed_seed_plus_deterministic_save**
Passes. Same seed + `save_deterministic()` = identical SHA-256 end to end.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use current test function names for traceability.

This section references test names that do not match the current test_q* function names in test_entropy_sources.py, which makes cross-checking harder.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@experiments/checkpoint_reproducibility/RESULTS.md` around lines 66 - 75, The
RESULT.md section references old test names (e.g.,
test_torch_save_byte_stability, test_fixed_seed_isolates_save_format_entropy,
test_fixed_seed_plus_deterministic_save) which no longer match the actual test
functions in test_entropy_sources.py (they are now named test_q*); update the
text to use the current test function names from test_entropy_sources.py (or add
a short parenthetical mapping from the old names to the new test_q* names) so
readers can trace results to the real tests (search for test_q in
test_entropy_sources.py to find the exact identifiers to use).

Comment on lines +79 to +85
## Open questions

- Cross-machine: different hardware may produce different floats even with
identical seeds. Not tested.
- CUDA nondeterminism: needs a GPU to actually observe. The test is informational.
- Multiprocessing: skipped on Windows. Expected to work on Linux with an explicit
DataLoader generator seed. No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a trailing newline to satisfy markdownlint MD047.

Static analysis reports this file should end with a single newline character.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@experiments/checkpoint_reproducibility/RESULTS.md` around lines 79 - 85, The
file ends without a final newline which violates markdownlint MD047; update the
file "experiments/checkpoint_reproducibility/RESULTS.md" by adding a single
trailing newline character at the end of the file so the last line (the
"Multiprocessing..." bullet) is terminated with a newline and the file ends with
exactly one newline.

Comment on lines +17 to +19
def hash_file(path):
with open(path, "rb") as f:
return hashlib.sha256(f.read()).hexdigest() No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Stream hashing instead of loading the whole file into RAM.

hash_file() currently reads the entire checkpoint at once, which scales poorly for larger files.

Suggested diff
-def hash_file(path):
-    with open(path, "rb") as f:
-        return hashlib.sha256(f.read()).hexdigest()
+def hash_file(path, chunk_size: int = 1024 * 1024):
+    digest = hashlib.sha256()
+    with open(path, "rb") as f:
+        for chunk in iter(lambda: f.read(chunk_size), b""):
+            digest.update(chunk)
+    return digest.hexdigest()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@experiments/checkpoint_reproducibility/utils.py` around lines 17 - 19,
hash_file currently reads the entire file into memory which fails for large
checkpoints; change it to stream the file in fixed-size chunks (e.g., 64KB or
65536 bytes) and call hashlib.sha256().update(chunk) in a loop until EOF, then
return hasher.hexdigest(); keep the same function name hash_file and signature
in utils.py and ensure the file is opened in "rb" mode as it is now.

Comment on lines +18 to +19
It is — that's `test_same_seed_same_weights`, and it's the floor everything else
builds on.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update references to actual test names in the current suite.

Examples like test_same_seed_same_weights and test_fixed_seed_isolates_save_format_entropy do not match current test_q* names, which makes the narrative harder to verify against code.

Also applies to: 45-46

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@experiments/checkpoint_reproducibility/WHY.md` around lines 18 - 19, Update
the documentation references that mention outdated test names (e.g.,
test_same_seed_same_weights and test_fixed_seed_isolates_save_format_entropy) to
the current test function names used in the suite (look for the test_q*
functions); find the mentions in WHY.md and replace them with the exact current
test identifiers (or add a parenthetical mapping) so readers can directly
correlate the narrative to the actual tests (search for
test_same_seed_same_weights/test_fixed_seed_isolates_save_format_entropy and
substitute the correct test_q... names or add a short note mapping old→new).

Comment on lines +68 to +73
- **Cross-machine**: different CPU architectures can produce slightly different
floating-point results even with the same seed. Not tested here.
- **CUDA**: the nondeterministic mode test is informational only — the test
machine is CPU-only. On CUDA some operations are nondeterministic by default.
- **Distributed training**: gradient averaging across workers introduces ordering
issues that are hard to control. Out of scope. No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a trailing newline to satisfy markdownlint MD047.

Static analysis reports this file should end with a single newline character.

🧰 Tools
🪛 LanguageTool

[style] ~73-~73: To elevate your writing, try using a synonym here.
Context: ...s introduces ordering issues that are hard to control. Out of scope.

(HARD_TO)

🪛 markdownlint-cli2 (0.21.0)

[warning] 73-73: Files should end with a single newline character

(MD047, single-trailing-newline)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@experiments/checkpoint_reproducibility/WHY.md` around lines 68 - 73, The file
ends without a trailing newline which triggers markdownlint MD047; edit
experiments/checkpoint_reproducibility/WHY.md and add a single newline character
at the end of the file (after the last bullet "Distributed training: gradient
averaging across workers introduces ordering issues that are hard to control.
Out of scope.") so the file ends with exactly one trailing newline.

]
requires-python = ">=3.9"
authors = [{name = "OpenVerifiableLLM Contributors"}]
requires-python = ">=3.10"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n pyproject.toml

Repository: AOSSIE-Org/OpenVerifiableLLM

Length of output: 1241


Align Ruff's target Python version with requires-python.

requires-python is >=3.10, but Ruff is set to py39. This mismatch can cause incorrect lint/format expectations.

Suggested diff
 [tool.ruff]
 line-length = 100
-target-version = "py39"
+target-version = "py310"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` at line 10, The Ruff configuration's target Python version
currently set to py39 must be updated to match requires-python = ">=3.10";
locate the Ruff settings in pyproject.toml (the ruff section and the
target-version key) and change the target from "py39" to "py310" (or the
equivalent list including "py310") so Ruff's linting/formatting targets Python
3.10+ consistently with requires-python.

@github-actions github-actions bot added size/XL and removed size/XL labels Mar 14, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
experiments/checkpoint_reproducibility/utils.py (1)

17-19: 🧹 Nitpick | 🔵 Trivial

Stream hashing to avoid loading entire file into memory.

This was flagged in a previous review. The current implementation reads the entire checkpoint file into memory at once, which can be problematic for large model checkpoints.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@experiments/checkpoint_reproducibility/utils.py` around lines 17 - 19, The
hash_file function currently reads the entire file into memory; change it to
stream the file in fixed-size chunks to avoid high memory usage: create a
hashlib.sha256() hasher, open the file in "rb" and iteratively read and feed
chunks (e.g., 64KB) to hasher.update(...) until EOF, then return
hasher.hexdigest(); update the implementation of hash_file accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@experiments/checkpoint_reproducibility/run_experiment.py`:
- Line 13: Replace use of typing generics in run_experiment.py: remove `from
typing import Any, Dict, List, Tuple` and update any annotations that use
`List`, `Dict`, `Tuple` to use the built-in `list`, `dict`, `tuple` (and keep
`Any` if still needed or import `Any` only) — e.g., change
signatures/annotations referencing `List[...]`, `Dict[...]`, `Tuple[...]` to
`list[...]`, `dict[...]`, `tuple[...]` respectively; ensure all occurrences
(function params, returns, variables) are updated and unused typing imports are
removed.

In `@experiments/checkpoint_reproducibility/test_entropy_sources.py`:
- Around line 36-38: INPUTS and TARGETS are created as module-level mutable
tensors using torch.Generator().manual_seed(99), which can lead to test
pollution if any test mutates them; change this by producing fresh tensors per
test—either convert the module-level constants into a pytest fixture (e.g., a
fixture named inputs_targets that calls g = torch.Generator().manual_seed(99)
and returns torch.randn(...) pairs) or replace usages with a small helper
function (e.g., get_inputs_targets()) that creates and returns new tensors using
torch.Generator().manual_seed(99) so each test gets its own non-shared INPUTS
and TARGETS instead of mutating the module globals.
- Around line 141-143: The test uses "assert not match" to expect differing
hashes which yields an XPASS when torch.save becomes stable; change the logic to
assert that hashes match (assert match) and mark the non-matching case as an
expected failure so XPASS now indicates the library fixed the behavior —
specifically, update the assertion around the "match" variable and wrap the
non-matching branch with a pytest.xfail (or use `@pytest.mark.xfail`) so when
match is False the test is xfailed and when match is True the test passes,
referencing the existing "match" variable and the code path that uses
torch.save.

---

Duplicate comments:
In `@experiments/checkpoint_reproducibility/utils.py`:
- Around line 17-19: The hash_file function currently reads the entire file into
memory; change it to stream the file in fixed-size chunks to avoid high memory
usage: create a hashlib.sha256() hasher, open the file in "rb" and iteratively
read and feed chunks (e.g., 64KB) to hasher.update(...) until EOF, then return
hasher.hexdigest(); update the implementation of hash_file accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 665c2a8e-cf18-44c1-8336-efb71b6e9c95

📥 Commits

Reviewing files that changed from the base of the PR and between f871f7d and 5680a09.

📒 Files selected for processing (4)
  • experiments/checkpoint_reproducibility/run_experiment.py
  • experiments/checkpoint_reproducibility/test_entropy_sources.py
  • experiments/checkpoint_reproducibility/utils.py
  • openverifiablellm/verify.py

import os
import platform
import random
from typing import Any, Dict, List, Tuple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Prefer built-in generics over typing imports for Python 3.10+.

Since the project requires Python ≥3.10 (per pyproject.toml), you can use built-in list, tuple, and dict directly in type hints instead of importing from typing.

♻️ Proposed fix
-from typing import Any, Dict, List, Tuple
+from typing import Any

Then update the type hints:

-    broken: List[Tuple[str, Dict[str, Any]]] = [
+    broken: list[tuple[str, dict[str, Any]]] = [
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@experiments/checkpoint_reproducibility/run_experiment.py` at line 13, Replace
use of typing generics in run_experiment.py: remove `from typing import Any,
Dict, List, Tuple` and update any annotations that use `List`, `Dict`, `Tuple`
to use the built-in `list`, `dict`, `tuple` (and keep `Any` if still needed or
import `Any` only) — e.g., change signatures/annotations referencing
`List[...]`, `Dict[...]`, `Tuple[...]` to `list[...]`, `dict[...]`, `tuple[...]`
respectively; ensure all occurrences (function params, returns, variables) are
updated and unused typing imports are removed.

Comment on lines +36 to +38
g = torch.Generator().manual_seed(99)
INPUTS = torch.randn(5, 10, generator=g)
TARGETS = torch.randn(5, 2, generator=g)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Module-level mutable globals may cause test pollution.

INPUTS and TARGETS are generated once at module load time. While the generator is seeded, if any test mutates these tensors in-place, subsequent tests would see corrupted data. Consider using a pytest fixture or generating fresh tensors per test if mutation is possible.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@experiments/checkpoint_reproducibility/test_entropy_sources.py` around lines
36 - 38, INPUTS and TARGETS are created as module-level mutable tensors using
torch.Generator().manual_seed(99), which can lead to test pollution if any test
mutates them; change this by producing fresh tensors per test—either convert the
module-level constants into a pytest fixture (e.g., a fixture named
inputs_targets that calls g = torch.Generator().manual_seed(99) and returns
torch.randn(...) pairs) or replace usages with a small helper function (e.g.,
get_inputs_targets()) that creates and returns new tensors using
torch.Generator().manual_seed(99) so each test gets its own non-shared INPUTS
and TARGETS instead of mutating the module globals.

Comment on lines +141 to +143
# Expecting this to fail (hashes differ due to ZIP metadata).
# If it passes (XPASS), torch has fixed this internally - document it.
assert not match
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inverted assertion may cause confusing XPASS behavior.

The test asserts not match expecting hashes to differ, but when torch.save is stable (PyTorch 2.x), this causes an XPASS with a failing assertion. Consider restructuring: assert that hashes match and mark the test as xfail when they don't, so XPASS indicates the expected "fixed" behavior.

🛠️ Proposed fix
-    # Expecting this to fail (hashes differ due to ZIP metadata).
-    # If it passes (XPASS), torch has fixed this internally - document it.
-    assert not match
+    # On older PyTorch: hashes differ due to ZIP metadata (test passes as xfail).
+    # On PyTorch 2.x: hashes match, test unexpectedly passes (XPASS).
+    assert match
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@experiments/checkpoint_reproducibility/test_entropy_sources.py` around lines
141 - 143, The test uses "assert not match" to expect differing hashes which
yields an XPASS when torch.save becomes stable; change the logic to assert that
hashes match (assert match) and mark the non-matching case as an expected
failure so XPASS now indicates the library fixed the behavior — specifically,
update the assertion around the "match" variable and wrap the non-matching
branch with a pytest.xfail (or use `@pytest.mark.xfail`) so when match is False
the test is xfailed and when match is True the test passes, referencing the
existing "match" variable and the code path that uses torch.save.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant