[EXPERIMENT] Checkpoint reproducibility and entropy sources in PyTorch training#73
[EXPERIMENT] Checkpoint reproducibility and entropy sources in PyTorch training#73ryoari wants to merge 7 commits intoAOSSIE-Org:mainfrom
Conversation
…d Update pro ject metadata to use contributor-based authorship
WalkthroughThis PR adds a checkpoint reproducibility experiment: documentation (README, WHY, RESULTS, SOURCES), a deterministic experiment runner and utilities ( Changes
Sequence Diagram(s)sequenceDiagram
participant Runner as Experiment Runner
participant Dataset as Dataset Factory
participant Model as TinyModel
participant Trainer as Training Loop
participant Saver as Checkpoint Saver
participant Hasher as SHA-256 Hasher
Runner->>Dataset: make_dataset(base_seed)
Dataset-->>Runner: TensorDataset (seeded)
Runner->>Model: initialize TinyModel (base_seed)
Model-->>Runner: model instance
Runner->>Trainer: set seeds (torch/numpy/python)
Trainer->>Model: train for N epochs (forward/backward)
Trainer-->>Runner: trained model / state_dict
Runner->>Saver: save checkpoint (deterministic or torch.save)
alt deterministic save
Saver->>Saver: write raw state_dict bytes
else torch.save
Saver->>Saver: write container (torch.save)
end
Saver-->>Runner: checkpoint file
Runner->>Hasher: compute SHA-256
Hasher-->>Runner: file hash
Runner->>Runner: compare hashes vs baseline / other runs
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 18
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@experiments/checkpoint_reproducibility/README.md`:
- Around line 51-56: The README.md file's final section "Windows note" is
missing a trailing newline which triggers markdownlint rule MD047; open the
README.md (the "Windows note" section) and add a single newline character at the
end of the file so the file ends with exactly one trailing newline.
- Around line 46-47: Update the README expected pytest result text that
currently reads "14 passed, 0 failed" (the line mentioning `run_experiment.py`
output) to match the actual pytest summary which includes xfail/xpass counts;
locate the "Expected: `run_experiment.py` shows..." sentence and replace the
simple "14 passed, 0 failed" wording with the real test summary that includes
xfailed/xpassed (or a generic phrasing like "tests pass with xfailed/xpassed
reported") so the README reflects the xfail/xpass cases.
In `@experiments/checkpoint_reproducibility/RESULTS.md`:
- Around line 79-85: The file ends without a final newline which violates
markdownlint MD047; update the file
"experiments/checkpoint_reproducibility/RESULTS.md" by adding a single trailing
newline character at the end of the file so the last line (the
"Multiprocessing..." bullet) is terminated with a newline and the file ends with
exactly one newline.
- Around line 66-75: The RESULT.md section references old test names (e.g.,
test_torch_save_byte_stability, test_fixed_seed_isolates_save_format_entropy,
test_fixed_seed_plus_deterministic_save) which no longer match the actual test
functions in test_entropy_sources.py (they are now named test_q*); update the
text to use the current test function names from test_entropy_sources.py (or add
a short parenthetical mapping from the old names to the new test_q* names) so
readers can trace results to the real tests (search for test_q in
test_entropy_sources.py to find the exact identifiers to use).
- Around line 31-35: The pytest summary under the "## pytest" header currently
shows only "14 passed, 0 failed" which omits xfail/xpass outcomes; update the
fenced code block to reflect the full test result counts (e.g., passed, failed,
xfailed, xpassed, skipped) produced by the current test run so the document
matches reality—re-run pytest (or use the saved CI output) to get the exact
counts and replace the existing block text accordingly while preserving the
Markdown fenced code block format.
In `@experiments/checkpoint_reproducibility/run_experiment.py`:
- Around line 41-47: The save_deterministic serializer in run_experiment.py is
duplicated in test_entropy_sources.py; extract this logic into a single shared
helper (e.g., create a new function save_deterministic in a small module like
experiments.checkpoint_reproducibility.utils or experiments.utils) that
preserves the exact behavior (write k.encode() then
v.detach().cpu().contiguous().numpy().tobytes()), then replace the local
definitions in both run_experiment.py and test_entropy_sources.py with an import
of the shared save_deterministic function and remove the duplicate
implementations.
- Around line 105-166: The run() function currently builds a files list then
performs many train_and_save calls; if an exception occurs the final cleanup
loop may never run leaving artifacts—wrap the main experiment loop (everything
after files = [] up to before the final deletion) in a try/finally so the
deletion logic always executes; ensure files is declared before the try, move
the for f in files: if os.path.exists(f): os.remove(f) into the finally block,
and keep using the same path names created earlier (references: run, files,
train_and_save, print_result) so that any checkpoints are removed even on error.
- Around line 1-169: Ruff reports style issues in this module; run the
auto-formatter (ruff --fix) on this file to apply import grouping/ordering,
spacing, blank-line and line-length fixes, and then re-run CI; focus changes
around the top-level imports and the definitions of make_dataset,
save_deterministic, train_and_save, and run to ensure proper blank lines and
PEP8-compliant spacing so tests pass.
In `@experiments/checkpoint_reproducibility/test_entropy_sources.py`:
- Around line 136-140: The test currently writes and compares files using
hard-coded paths (torch.save(sd, "/tmp/_q2a.pt"), torch.save(sd,
"/tmp/_q2b.pt"), hash_file, os.remove), which is brittle on Windows/CI; change
the test to accept the pytest tmp_path fixture and create file paths under
tmp_path (e.g., tmp_path / "_q2a.pt" and tmp_path / "_q2b.pt"), use those paths
for torch.save and hash_file, and remove the explicit os.remove calls (or remove
via tmp_path cleanup) so the test becomes platform-independent; apply the same
replacement pattern wherever torch.save/hash_file/os.remove are called (notably
the other ranges mentioned).
- Around line 1-323: Run Ruff autoformat on this test file and apply fixes to
satisfy the linter: execute ruff --fix on
experiments/checkpoint_reproducibility/test_entropy_sources.py to correct import
ordering, spacing, line-length/indentation issues, and any unused-import or
formatting offenses; ensure the file compiles and tests still run (preserve
function/class names TinyModel, train, save_deterministic and all test_*
functions) and commit the updated file.
- Around line 19-20: The sys.path insertion adds the parent directory one level
too high so the import in this file fails; change the inserted path to the test
file's directory (the directory containing
experiments/checkpoint_reproducibility) so that utils.py can be found — update
the sys.path.insert call that currently uses
os.path.dirname(os.path.dirname(os.path.abspath(__file__))) to instead insert
the directory holding this test (e.g., use
os.path.dirname(os.path.abspath(__file__))) before the from utils import
hash_file import so the module resolves correctly.
- Around line 68-70: The train() function mutates global flags
torch.use_deterministic_algorithms, torch.backends.cudnn.deterministic, and
torch.backends.cudnn.benchmark without restoring them; wrap the section that
sets these flags in a try/finally: before changing, read and save the current
values of torch.use_deterministic_algorithms(),
torch.backends.cudnn.deterministic, and torch.backends.cudnn.benchmark into
local variables, then set the desired values, run the training logic, and in the
finally block restore the saved values back via
torch.use_deterministic_algorithms(saved), torch.backends.cudnn.deterministic =
saved, and torch.backends.cudnn.benchmark = saved to avoid leaking state to
other tests.
In `@experiments/checkpoint_reproducibility/utils.py`:
- Around line 1-19: Run Ruff autoformat on
experiments/checkpoint_reproducibility/utils.py and apply the suggested
reformatting changes: reorder and group imports (standard libs first: hashlib,
random; then third-party: numpy as np, torch), ensure a blank line between
imports and top-level functions, enforce two blank lines between the top-level
defs set_seed and hash_file, and fix any whitespace/line-ending issues so Ruff
passes; verify the functions set_seed and hash_file remain unchanged
semantically after formatting.
- Around line 17-19: hash_file currently reads the entire file into memory which
fails for large checkpoints; change it to stream the file in fixed-size chunks
(e.g., 64KB or 65536 bytes) and call hashlib.sha256().update(chunk) in a loop
until EOF, then return hasher.hexdigest(); keep the same function name hash_file
and signature in utils.py and ensure the file is opened in "rb" mode as it is
now.
In `@experiments/checkpoint_reproducibility/WHY.md`:
- Around line 68-73: The file ends without a trailing newline which triggers
markdownlint MD047; edit experiments/checkpoint_reproducibility/WHY.md and add a
single newline character at the end of the file (after the last bullet
"Distributed training: gradient averaging across workers introduces ordering
issues that are hard to control. Out of scope.") so the file ends with exactly
one trailing newline.
- Around line 18-19: Update the documentation references that mention outdated
test names (e.g., test_same_seed_same_weights and
test_fixed_seed_isolates_save_format_entropy) to the current test function names
used in the suite (look for the test_q* functions); find the mentions in WHY.md
and replace them with the exact current test identifiers (or add a parenthetical
mapping) so readers can directly correlate the narrative to the actual tests
(search for
test_same_seed_same_weights/test_fixed_seed_isolates_save_format_entropy and
substitute the correct test_q... names or add a short note mapping old→new).
In `@pyproject.toml`:
- Line 10: The Ruff configuration's target Python version currently set to py39
must be updated to match requires-python = ">=3.10"; locate the Ruff settings in
pyproject.toml (the ruff section and the target-version key) and change the
target from "py39" to "py310" (or the equivalent list including "py310") so
Ruff's linting/formatting targets Python 3.10+ consistently with
requires-python.
- Around line 16-17: Remove "numpy" and "torch" from the top-level dependencies
list in pyproject.toml and add them to an optional dependency group (for example
[project.optional-dependencies] under a group like "experiments" or "ml");
ensure the group contains the appropriate specifiers for numpy and torch so
users can opt into them (e.g., poetry/PEP 621 style optional deps). Reference
that numpy is only required by experiments/checkpoint_reproducibility and torch
is already safe in openverifiablellm/environment.py (keeps its try-except
fallback), so do not change code there—only move the two packages into the
optional-dependencies section and update any docs/README to mention how to
install extras (e.g., pip install .[experiments]).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 0e4a040c-39b0-4b08-82a8-3c13b92d4cec
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (9)
_ckpt_Serializationexperiments/checkpoint_reproducibility/README.mdexperiments/checkpoint_reproducibility/RESULTS.mdexperiments/checkpoint_reproducibility/SOURCES.mdexperiments/checkpoint_reproducibility/WHY.mdexperiments/checkpoint_reproducibility/run_experiment.pyexperiments/checkpoint_reproducibility/test_entropy_sources.pyexperiments/checkpoint_reproducibility/utils.pypyproject.toml
| Expected: `run_experiment.py` shows one genuinely non-deterministic result | ||
| (unseeded torch RNG). Pytest is 14 passed, 0 failed. |
There was a problem hiding this comment.
Refresh expected pytest result text.
The current test file includes xfail/xpass cases, so the expected output should not be documented as only 14 passed, 0 failed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@experiments/checkpoint_reproducibility/README.md` around lines 46 - 47,
Update the README expected pytest result text that currently reads "14 passed, 0
failed" (the line mentioning `run_experiment.py` output) to match the actual
pytest summary which includes xfail/xpass counts; locate the "Expected:
`run_experiment.py` shows..." sentence and replace the simple "14 passed, 0
failed" wording with the real test summary that includes xfailed/xpassed (or a
generic phrasing like "tests pass with xfailed/xpassed reported") so the README
reflects the xfail/xpass cases.
| ## Windows note | ||
|
|
||
| `num_workers > 0` is skipped automatically on Windows. PyTorch uses `spawn` for | ||
| DataLoader workers which re-imports the script in each process. Making it work | ||
| needs more setup than the experiment is worth. See RESULTS.md for expected Linux | ||
| behaviour. No newline at end of file |
There was a problem hiding this comment.
Add a trailing newline to satisfy markdownlint MD047.
Static analysis indicates this Markdown file should end with a single newline.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@experiments/checkpoint_reproducibility/README.md` around lines 51 - 56, The
README.md file's final section "Windows note" is missing a trailing newline
which triggers markdownlint rule MD047; open the README.md (the "Windows note"
section) and add a single newline character at the end of the file so the file
ends with exactly one trailing newline.
| ## pytest | ||
|
|
||
| ``` | ||
| 14 passed, 0 failed | ||
| ``` |
There was a problem hiding this comment.
Update the pytest summary to match the current suite outcomes.
The suite currently includes xfail/xpass scenarios; documenting only 14 passed, 0 failed is misleading for reproducibility interpretation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@experiments/checkpoint_reproducibility/RESULTS.md` around lines 31 - 35, The
pytest summary under the "## pytest" header currently shows only "14 passed, 0
failed" which omits xfail/xpass outcomes; update the fenced code block to
reflect the full test result counts (e.g., passed, failed, xfailed, xpassed,
skipped) produced by the current test run so the document matches reality—re-run
pytest (or use the saved CI output) to get the exact counts and replace the
existing block text accordingly while preserving the Markdown fenced code block
format.
| **test_torch_save_byte_stability** | ||
| Expected to fail (assert not match), but `torch.save` gave identical hashes | ||
| on this version. The test documents this outcome — see above. | ||
|
|
||
| **test_fixed_seed_isolates_save_format_entropy** | ||
| In-memory weights match (seed is fixed). File bytes also match on PyTorch 2.x. | ||
| The test prints the result either way — it's documentation, not a pass/fail gate. | ||
|
|
||
| **test_fixed_seed_plus_deterministic_save** | ||
| Passes. Same seed + `save_deterministic()` = identical SHA-256 end to end. |
There was a problem hiding this comment.
Use current test function names for traceability.
This section references test names that do not match the current test_q* function names in test_entropy_sources.py, which makes cross-checking harder.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@experiments/checkpoint_reproducibility/RESULTS.md` around lines 66 - 75, The
RESULT.md section references old test names (e.g.,
test_torch_save_byte_stability, test_fixed_seed_isolates_save_format_entropy,
test_fixed_seed_plus_deterministic_save) which no longer match the actual test
functions in test_entropy_sources.py (they are now named test_q*); update the
text to use the current test function names from test_entropy_sources.py (or add
a short parenthetical mapping from the old names to the new test_q* names) so
readers can trace results to the real tests (search for test_q in
test_entropy_sources.py to find the exact identifiers to use).
| ## Open questions | ||
|
|
||
| - Cross-machine: different hardware may produce different floats even with | ||
| identical seeds. Not tested. | ||
| - CUDA nondeterminism: needs a GPU to actually observe. The test is informational. | ||
| - Multiprocessing: skipped on Windows. Expected to work on Linux with an explicit | ||
| DataLoader generator seed. No newline at end of file |
There was a problem hiding this comment.
Add a trailing newline to satisfy markdownlint MD047.
Static analysis reports this file should end with a single newline character.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@experiments/checkpoint_reproducibility/RESULTS.md` around lines 79 - 85, The
file ends without a final newline which violates markdownlint MD047; update the
file "experiments/checkpoint_reproducibility/RESULTS.md" by adding a single
trailing newline character at the end of the file so the last line (the
"Multiprocessing..." bullet) is terminated with a newline and the file ends with
exactly one newline.
| def hash_file(path): | ||
| with open(path, "rb") as f: | ||
| return hashlib.sha256(f.read()).hexdigest() No newline at end of file |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Stream hashing instead of loading the whole file into RAM.
hash_file() currently reads the entire checkpoint at once, which scales poorly for larger files.
Suggested diff
-def hash_file(path):
- with open(path, "rb") as f:
- return hashlib.sha256(f.read()).hexdigest()
+def hash_file(path, chunk_size: int = 1024 * 1024):
+ digest = hashlib.sha256()
+ with open(path, "rb") as f:
+ for chunk in iter(lambda: f.read(chunk_size), b""):
+ digest.update(chunk)
+ return digest.hexdigest()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@experiments/checkpoint_reproducibility/utils.py` around lines 17 - 19,
hash_file currently reads the entire file into memory which fails for large
checkpoints; change it to stream the file in fixed-size chunks (e.g., 64KB or
65536 bytes) and call hashlib.sha256().update(chunk) in a loop until EOF, then
return hasher.hexdigest(); keep the same function name hash_file and signature
in utils.py and ensure the file is opened in "rb" mode as it is now.
| It is — that's `test_same_seed_same_weights`, and it's the floor everything else | ||
| builds on. |
There was a problem hiding this comment.
Update references to actual test names in the current suite.
Examples like test_same_seed_same_weights and test_fixed_seed_isolates_save_format_entropy do not match current test_q* names, which makes the narrative harder to verify against code.
Also applies to: 45-46
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@experiments/checkpoint_reproducibility/WHY.md` around lines 18 - 19, Update
the documentation references that mention outdated test names (e.g.,
test_same_seed_same_weights and test_fixed_seed_isolates_save_format_entropy) to
the current test function names used in the suite (look for the test_q*
functions); find the mentions in WHY.md and replace them with the exact current
test identifiers (or add a parenthetical mapping) so readers can directly
correlate the narrative to the actual tests (search for
test_same_seed_same_weights/test_fixed_seed_isolates_save_format_entropy and
substitute the correct test_q... names or add a short note mapping old→new).
| - **Cross-machine**: different CPU architectures can produce slightly different | ||
| floating-point results even with the same seed. Not tested here. | ||
| - **CUDA**: the nondeterministic mode test is informational only — the test | ||
| machine is CPU-only. On CUDA some operations are nondeterministic by default. | ||
| - **Distributed training**: gradient averaging across workers introduces ordering | ||
| issues that are hard to control. Out of scope. No newline at end of file |
There was a problem hiding this comment.
Add a trailing newline to satisfy markdownlint MD047.
Static analysis reports this file should end with a single newline character.
🧰 Tools
🪛 LanguageTool
[style] ~73-~73: To elevate your writing, try using a synonym here.
Context: ...s introduces ordering issues that are hard to control. Out of scope.
(HARD_TO)
🪛 markdownlint-cli2 (0.21.0)
[warning] 73-73: Files should end with a single newline character
(MD047, single-trailing-newline)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@experiments/checkpoint_reproducibility/WHY.md` around lines 68 - 73, The file
ends without a trailing newline which triggers markdownlint MD047; edit
experiments/checkpoint_reproducibility/WHY.md and add a single newline character
at the end of the file (after the last bullet "Distributed training: gradient
averaging across workers introduces ordering issues that are hard to control.
Out of scope.") so the file ends with exactly one trailing newline.
| ] | ||
| requires-python = ">=3.9" | ||
| authors = [{name = "OpenVerifiableLLM Contributors"}] | ||
| requires-python = ">=3.10" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n pyproject.tomlRepository: AOSSIE-Org/OpenVerifiableLLM
Length of output: 1241
Align Ruff's target Python version with requires-python.
requires-python is >=3.10, but Ruff is set to py39. This mismatch can cause incorrect lint/format expectations.
Suggested diff
[tool.ruff]
line-length = 100
-target-version = "py39"
+target-version = "py310"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pyproject.toml` at line 10, The Ruff configuration's target Python version
currently set to py39 must be updated to match requires-python = ">=3.10";
locate the Ruff settings in pyproject.toml (the ruff section and the
target-version key) and change the target from "py39" to "py310" (or the
equivalent list including "py310") so Ruff's linting/formatting targets Python
3.10+ consistently with requires-python.
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
experiments/checkpoint_reproducibility/utils.py (1)
17-19: 🧹 Nitpick | 🔵 TrivialStream hashing to avoid loading entire file into memory.
This was flagged in a previous review. The current implementation reads the entire checkpoint file into memory at once, which can be problematic for large model checkpoints.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@experiments/checkpoint_reproducibility/utils.py` around lines 17 - 19, The hash_file function currently reads the entire file into memory; change it to stream the file in fixed-size chunks to avoid high memory usage: create a hashlib.sha256() hasher, open the file in "rb" and iteratively read and feed chunks (e.g., 64KB) to hasher.update(...) until EOF, then return hasher.hexdigest(); update the implementation of hash_file accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@experiments/checkpoint_reproducibility/run_experiment.py`:
- Line 13: Replace use of typing generics in run_experiment.py: remove `from
typing import Any, Dict, List, Tuple` and update any annotations that use
`List`, `Dict`, `Tuple` to use the built-in `list`, `dict`, `tuple` (and keep
`Any` if still needed or import `Any` only) — e.g., change
signatures/annotations referencing `List[...]`, `Dict[...]`, `Tuple[...]` to
`list[...]`, `dict[...]`, `tuple[...]` respectively; ensure all occurrences
(function params, returns, variables) are updated and unused typing imports are
removed.
In `@experiments/checkpoint_reproducibility/test_entropy_sources.py`:
- Around line 36-38: INPUTS and TARGETS are created as module-level mutable
tensors using torch.Generator().manual_seed(99), which can lead to test
pollution if any test mutates them; change this by producing fresh tensors per
test—either convert the module-level constants into a pytest fixture (e.g., a
fixture named inputs_targets that calls g = torch.Generator().manual_seed(99)
and returns torch.randn(...) pairs) or replace usages with a small helper
function (e.g., get_inputs_targets()) that creates and returns new tensors using
torch.Generator().manual_seed(99) so each test gets its own non-shared INPUTS
and TARGETS instead of mutating the module globals.
- Around line 141-143: The test uses "assert not match" to expect differing
hashes which yields an XPASS when torch.save becomes stable; change the logic to
assert that hashes match (assert match) and mark the non-matching case as an
expected failure so XPASS now indicates the library fixed the behavior —
specifically, update the assertion around the "match" variable and wrap the
non-matching branch with a pytest.xfail (or use `@pytest.mark.xfail`) so when
match is False the test is xfailed and when match is True the test passes,
referencing the existing "match" variable and the code path that uses
torch.save.
---
Duplicate comments:
In `@experiments/checkpoint_reproducibility/utils.py`:
- Around line 17-19: The hash_file function currently reads the entire file into
memory; change it to stream the file in fixed-size chunks to avoid high memory
usage: create a hashlib.sha256() hasher, open the file in "rb" and iteratively
read and feed chunks (e.g., 64KB) to hasher.update(...) until EOF, then return
hasher.hexdigest(); update the implementation of hash_file accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 665c2a8e-cf18-44c1-8336-efb71b6e9c95
📒 Files selected for processing (4)
experiments/checkpoint_reproducibility/run_experiment.pyexperiments/checkpoint_reproducibility/test_entropy_sources.pyexperiments/checkpoint_reproducibility/utils.pyopenverifiablellm/verify.py
| import os | ||
| import platform | ||
| import random | ||
| from typing import Any, Dict, List, Tuple |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Prefer built-in generics over typing imports for Python 3.10+.
Since the project requires Python ≥3.10 (per pyproject.toml), you can use built-in list, tuple, and dict directly in type hints instead of importing from typing.
♻️ Proposed fix
-from typing import Any, Dict, List, Tuple
+from typing import AnyThen update the type hints:
- broken: List[Tuple[str, Dict[str, Any]]] = [
+ broken: list[tuple[str, dict[str, Any]]] = [🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@experiments/checkpoint_reproducibility/run_experiment.py` at line 13, Replace
use of typing generics in run_experiment.py: remove `from typing import Any,
Dict, List, Tuple` and update any annotations that use `List`, `Dict`, `Tuple`
to use the built-in `list`, `dict`, `tuple` (and keep `Any` if still needed or
import `Any` only) — e.g., change signatures/annotations referencing
`List[...]`, `Dict[...]`, `Tuple[...]` to `list[...]`, `dict[...]`, `tuple[...]`
respectively; ensure all occurrences (function params, returns, variables) are
updated and unused typing imports are removed.
| g = torch.Generator().manual_seed(99) | ||
| INPUTS = torch.randn(5, 10, generator=g) | ||
| TARGETS = torch.randn(5, 2, generator=g) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Module-level mutable globals may cause test pollution.
INPUTS and TARGETS are generated once at module load time. While the generator is seeded, if any test mutates these tensors in-place, subsequent tests would see corrupted data. Consider using a pytest fixture or generating fresh tensors per test if mutation is possible.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@experiments/checkpoint_reproducibility/test_entropy_sources.py` around lines
36 - 38, INPUTS and TARGETS are created as module-level mutable tensors using
torch.Generator().manual_seed(99), which can lead to test pollution if any test
mutates them; change this by producing fresh tensors per test—either convert the
module-level constants into a pytest fixture (e.g., a fixture named
inputs_targets that calls g = torch.Generator().manual_seed(99) and returns
torch.randn(...) pairs) or replace usages with a small helper function (e.g.,
get_inputs_targets()) that creates and returns new tensors using
torch.Generator().manual_seed(99) so each test gets its own non-shared INPUTS
and TARGETS instead of mutating the module globals.
| # Expecting this to fail (hashes differ due to ZIP metadata). | ||
| # If it passes (XPASS), torch has fixed this internally - document it. | ||
| assert not match |
There was a problem hiding this comment.
Inverted assertion may cause confusing XPASS behavior.
The test asserts not match expecting hashes to differ, but when torch.save is stable (PyTorch 2.x), this causes an XPASS with a failing assertion. Consider restructuring: assert that hashes match and mark the test as xfail when they don't, so XPASS indicates the expected "fixed" behavior.
🛠️ Proposed fix
- # Expecting this to fail (hashes differ due to ZIP metadata).
- # If it passes (XPASS), torch has fixed this internally - document it.
- assert not match
+ # On older PyTorch: hashes differ due to ZIP metadata (test passes as xfail).
+ # On PyTorch 2.x: hashes match, test unexpectedly passes (XPASS).
+ assert match🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@experiments/checkpoint_reproducibility/test_entropy_sources.py` around lines
141 - 143, The test uses "assert not match" to expect differing hashes which
yields an XPASS when torch.save becomes stable; change the logic to assert that
hashes match (assert match) and mark the non-matching case as an expected
failure so XPASS now indicates the library fixed the behavior — specifically,
update the assertion around the "match" variable and wrap the non-matching
branch with a pytest.xfail (or use `@pytest.mark.xfail`) so when match is False
the test is xfailed and when match is True the test passes, referencing the
existing "match" variable and the code path that uses torch.save.
Addressed Issues:
#62
Fixes ( #62 )
Summary
This PR implements an experiment to validate whether identical training runs produce identical model checkpoints.
This question came up in a Discord discussion about whether checkpoint hashing could serve as a verification mechanism for deterministic training.
Before building a full verification pipeline, we need to confirm that training itself is reproducible.
Design Relevance
OpenVerifiableLLM relies on the assumption that a training run can be verified by reproducing it and comparing model checkpoints.
For this to work, two identical training runs must produce identical checkpoint bytes.
This experiment validates that assumption and identifies the entropy sources that must be controlled for deterministic training.
The results show that deterministic checkpoints require controlling:
These findings can help inform the design of a deterministic training configuration for the project.
What this experiment tests
The experiment investigates several potential sources of entropy in a simple PyTorch training loop.
Tested factors include:
• torch RNG seeding
• numpy / python RNG seeding
• dataset shuffle order
• dropout randomness
• checkpoint serialization format
• deterministic algorithm mode
Key findings
From the experiments:
• Torch RNG must be seeded – otherwise training is nondeterministic
• Deterministic algorithms must be enabled
• Checkpoint format can introduce entropy (
torch.savevs deterministic save)• numpy and python RNG do not affect a pure PyTorch training loop
• Dropout and shuffle remain deterministic when the torch RNG is seeded
Experiment structure
The experiment consists of:
run_experiment.py
A broad sweep that toggles entropy sources and prints checkpoint hashes.
test_entropy_sources.py
Pytest suite validating determinism assumptions.
Documentation:
• WHY.md – motivation and reasoning
• RESULTS.md – experiment output and interpretation
• SOURCES.md – references
Why this matters for OpenVerifiableLLM
The project aims to allow independent verification of model training.
For checkpoint hashing to be a reliable verification mechanism, training must be reproducible under controlled conditions.
This experiment demonstrates:
• which entropy sources matter
• which ones do not
• what must be controlled in the training pipeline
These results can inform the design of a deterministic training configuration for the project.
How to run
cd experiments/checkpoint_reproducibility
python run_experiment.py
pytest test_entropy_sources.py -v
Screenshots/Recordings:
output:
run_experiment.py
test_entropy_sources.py
Additional Notes:
AI Usage Disclosure
AI tools were used as a pair-programming and documentation aid, primarily for brainstorming potential entropy sources and improving documentation clarity.
All experiment design, implementation, debugging, and validation were performed locally by me.
Checklist
Summary by CodeRabbit
New Features
Dependencies
Style