Harden CI for self-hosted SLURM runners by sbryngelson · Pull Request #1135 · MFlowCode/MFC

sbryngelson · 2026-02-12T16:51:46Z

User description

Summary

Systematic hardening of CI infrastructure for self-hosted runners (Phoenix, Frontier) on Lustre filesystems:

SLURM monitor rewrite — Replace fragile squeue-exit-code polling with a proper state machine using get_job_state() (squeue primary, sacct fallback). Adds orphan job cleanup via EXIT trap, terminal state detection, and informative heartbeat messages
Fix bash segfault — Change fractional read -t 0.1 to integer read -t 1 to avoid process substitution FD corruption
Auto-retry sporadic test failures — Record failed test UUIDs to file, retry up to 5 failures automatically in CI. Catastrophic failures (segfaults, missing output) still propagate immediately
Lustre-safe workspace cleanup — Add retry-based rm -rf before checkout with clean: false to avoid ESTALE/ENOTEMPTY errors from git clean on Lustre
Robust sacct parsing — Use -X -P flags for parsable, job-level-only output
Guard pipelines against pipefail — Append || true to squeue/sacct command substitutions
Lustre-safe directory deletion — Retry shutil.rmtree up to 5 times with backoff in Python toolchain
Benchmark concurrency fix — Split concurrency group by event type so pull_request_review doesn't cancel pull_request runs
Restore benchmark triggers — Revert from workflow_run back to direct pull_request + pull_request_review triggers
Fix doc lint — Add build-time generated page IDs and support hyphens in @page/@ref patterns

Test plan

All lint/formatting checks pass
GitHub-hosted test matrix passes
Phoenix GPU tests (cpu, gpu-acc, gpu-omp) pass
Phoenix NVHPC tests pass
Frontier tests pass
Lustre cleanup step visible in self-hosted job logs

🤖 Generated with Claude Code

CodeAnt-AI Description

Harden CI for self-hosted SLURM runners, add test sharding and automatic test retries

What Changed

Tests can be sharded (e.g., --shard 1/2) so GPU test suites run split across CI jobs; the shard option is exposed to mfc test and propagated through submit scripts and workflow matrix.
CI now records failed test UUIDs and, when 1–5 tests fail, automatically retries only those tests; large failures still fail the job immediately.
SLURM monitor script now detects job state via squeue with sacct fallback, treats terminal states properly, prints clearer heartbeat messages with job state, cancels the SLURM job if the monitor exits abnormally, and avoids a bash read timeout bug by using an integer timeout.
Directory removals in the Python toolchain retry on transient filesystem errors to avoid ESTALE/ENOTEMPTY failures.
Benchmark workflow triggers and concurrency behavior restored to support approve-to-run flow for PRs and avoid canceling unrelated runs; build/test job scripts adjusted (shorter SLURM walltimes, different partitions/QoS, auto-requeue on preemption) to improve scheduling behavior.

Impact

✅ Fewer CI flakes due to targeted automatic test retries
✅ Fewer orphaned SLURM jobs when monitor exits unexpectedly
✅ Faster/safer GPU test runs by splitting suites across runners

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

read -t 0.1 (sub-second timeout) in a loop with process substitution file descriptors triggers a bash internal error (unwind_frame_run: read_builtin: frame not found) leading to a segfault. Use integer timeout (read -t 1) instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codeant-ai · 2026-02-12T16:51:50Z

CodeAnt AI is reviewing your PR.

Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

qodo-code-review · 2026-02-12T16:52:15Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Possible Issue In `sbatch_gpu_opts`, the new `#SBATCH --gres=gpu:H200:2` line lacks the trailing `\` used to build the multi-line string. This can truncate the intended `sbatch_gpu_opts` contents (e.g., dropping the following `--ntasks-per-node` line) depending on how the string is later expanded/used, and can lead to missing SBATCH directives at submission time. sbatch_gpu_opts="\ #SBATCH --gres=gpu:H200:2 #SBATCH --ntasks-per-node=8 # Number of cores per node required\ " Possible Issue Same potential string-construction issue in `sbatch_gpu_opts`: the added `#SBATCH --gres=gpu:H200:2` line does not end with `\`, which may prevent subsequent SBATCH lines from being included in the generated submission script content. sbatch_gpu_opts="\ #SBATCH --gres=gpu:H200:2 #SBATCH --ntasks-per-node=8 # Number of cores per node required\ " Scheduling Risk Switching from an explicit GPU partition list to only `--gres=gpu:H200:2` may fail or schedule unexpectedly on systems where GPU type is partitioned/controlled via `-p` and/or constraints rather than GRES alone. Confirm the Phoenix Slurm configuration supports selecting H200 nodes purely via GRES and that no required partition/QoS/account flags were implicitly provided before. sbatch_gpu_opts="\ #SBATCH --gres=gpu:H200:2 #SBATCH --ntasks-per-node=8 # Number of cores per node required\ "

.github/workflows/phoenix/submit-bench.sh

cubic-dev-ai

No issues found across 3 files

Confidence score: 5/5

Automated review surfaced no issues in the provided summaries.
No files require special attention.

.github/workflows/phoenix/submit.sh

codeant-ai · 2026-02-12T16:54:28Z

Nitpicks 🔍

🔒 No security issues identified
⚡ Recommended areas for review GPU resource & task density change The GPU reservation changed to request `H200` GPUs and `--ntasks-per-node` was increased to 8 (from 4). This can affect scheduling, partition compatibility and CPU/GPU oversubscription on some clusters. Confirm the cluster supports `--gres=gpu:H200:2` and that 8 tasks per node is appropriate for the target node type. Read timeout change The non-blocking `read -t` timeouts were changed from fractional values (0.1 / 0.5) to `1` second integers. This increases the maximum blocking time of the read operation and therefore can increase latency of job-status checks and heartbeats, and may slow detection of transient squeue failures. Consider making the timeout configurable or reduce it if faster responsiveness is required. Drain loop blocking delay The drain loop now uses `read -t 1` and can wait up to 1s for the final read attempt before exiting. That can add up to ~1s delay on drain completion. Verify this is acceptable and that the tail process is reliably terminated after draining (tail_pid handling).

codeant-ai · 2026-02-12T16:54:31Z

CodeAnt AI finished reviewing your PR.

coderabbitai · 2026-02-12T16:54:51Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Refactors SLURM monitoring to use state-driven polling, improves exit-code extraction and tailing behavior, adds cancellation on abnormal exit; updates GPU SBATCH requests to --gres=gpu:H200:2 with --ntasks-per-node=8; changes bench workflow triggers to PR events; adds CI retry for up to 5 failed tests and persists failed test UUIDs.

Changes

Cohort / File(s)	Summary
SLURM monitor script `.github/scripts/monitor_slurm_job.sh`	Replaces squeue-retry/backoff with state-driven loop (`get_job_state`, `is_terminal_state`), adds scontrol/sacct fallbacks for ExitCode, cancels job on abnormal exit, increases tail read timeout (0.1s→1s) with burst-read cap, drains output with stabilization checks and capped drain.
GPU SBATCH submit scripts `.github/workflows/phoenix/submit-bench.sh`, `.github/workflows/phoenix/submit.sh`	Activates GPU resource requests: replaces prior partition/constraint flags with `--gres=gpu:H200:2` and updates `--ntasks-per-node=8`, adjusting GPU and per-node task configuration.
CI bench workflow `.github/workflows/bench.yml`	Switches trigger from `workflow_run` to PR events (`pull_request`, `pull_request_review`), removes workflow_run-dependent PR-info steps, and simplifies concurrency/gating to use PR refs and review-based conditions.
CI test retry & failure persistence `.github/workflows/test.yml`, `toolchain/mfc/test/test.py`	Adds post-run retry that reruns up to 5 failed tests using `failed_uuids.txt`; test harness now writes/removes `failed_uuids.txt` to reflect current failures for selective retries.

Sequence Diagram(s)

sequenceDiagram
    participant Monitor as monitor_slurm_job.sh
    participant SlurmCtl as squeue/sacct/scontrol
    participant JobFS as Job output file (filesystem)
    participant Tail as tail process

    Monitor->>SlurmCtl: get_job_state(job_id) (squeue → sacct fallback)
    SlurmCtl-->>Monitor: STATE (PENDING/RUNNING/COMPLETED/UNKNOWN)
    alt STATE PENDING/CONFIGURING
        Monitor->>Monitor: sleep (~10s), continue polling
    else STATE RUNNING/COMPLETING
        Monitor->>Tail: start non-blocking tail (1s timeout, burst cap)
        Tail->>JobFS: read new lines
        JobFS-->>Tail: new output
        Tail-->>Monitor: heartbeat + output
    else STATE TERMINAL
        Monitor->>Tail: stop tail, drain remaining lines (1s timeout, cap)
        Monitor->>SlurmCtl: scontrol show job -> ExitCode (fallback sacct)
        SlurmCtl-->>Monitor: ExitCode or UNKNOWN
        alt ExitCode == 0
            Monitor-->>Monitor: set monitor_success, exit 0
        else
            Monitor->>SlurmCtl: scancel job_id (if needed)
            Monitor-->>Monitor: exit non-zero
        end
    else STATE UNKNOWN
        Monitor->>Monitor: periodic warning, longer sleep, retry
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

benchmarks hardening! #1078: Overlaps prior updates to SLURM monitoring and workflow scripts touching monitor_slurm_job.sh and submit workflows.
Shell completion auto-install and pre-commit hook improvements #1124: Related changes to SLURM submission/monitoring integrations and submit scripts.

Suggested labels

size:L

Poem

🐇 I hop through logs with whiskers bright and keen,
squeue and sacct whisper states in fields of green.
Two H200s hum, eight tasks snug on each node,
I stash tiny UUIDs down my burrowed road.
Thump — the monitor settles, output safe and serene.

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Merge Conflict Detection	⚠️ Warning	❌ Merge conflicts detected (16 files): ⚔️ `.github/scripts/monitor_slurm_job.sh` (content) ⚔️ `.github/workflows/bench.yml` (content) ⚔️ `.github/workflows/phoenix/submit-bench.sh` (content) ⚔️ `.github/workflows/phoenix/submit.sh` (content) ⚔️ `.github/workflows/test.yml` (content) ⚔️ `.pr_agent.toml` (content) ⚔️ `.typos.toml` (content) ⚔️ `CMakeLists.txt` (content) ⚔️ `docs/Doxyfile.in` (content) ⚔️ `docs/documentation/case.md` (content) ⚔️ `docs/documentation/readme.md` (content) ⚔️ `docs/documentation/references.md` (content) ⚔️ `docs/documentation/visualization.md` (content) ⚔️ `toolchain/mfc/lint_docs.py` (content) ⚔️ `toolchain/mfc/params/definitions.py` (content) ⚔️ `toolchain/mfc/test/test.py` (content) These conflicts must be resolved before merging into `master`.	Resolve conflicts locally and push changes to this branch.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title check	✅ Passed	The title 'Harden CI for self-hosted SLURM runners' clearly and concisely describes the primary objective of the pull request: improving robustness of CI infrastructure for self-hosted runners on SLURM systems.
Description check	✅ Passed	The PR description comprehensively covers all required sections: detailed summary of changes, motivation (CI hardening), multiple test confirmations, and proper checklist items completed.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

Updates the Phoenix (GT) GitHub Actions SLURM submission scripts to target H200 GPU resources for improved scheduling, and adjusts the SLURM job monitor script to avoid a Bash segfault caused by fractional read -t timeouts.

Changes:

Update Phoenix GPU sbatch directives to request H200 GPUs and adjust task count per node.
Replace fractional read -t timeouts with integer timeouts in the SLURM job monitor script.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
.github/workflows/phoenix/submit.sh	Switch GPU submission options to request H200 GPUs and increase `--ntasks-per-node`.
.github/workflows/phoenix/submit-bench.sh	Same as above for benchmark submissions.
.github/scripts/monitor_slurm_job.sh	Use integer `read -t` timeouts to prevent Bash segfaults during output streaming.

.github/workflows/phoenix/submit.sh

.github/workflows/phoenix/submit-bench.sh

.github/scripts/monitor_slurm_job.sh

PR MFlowCode#1124 changed bench.yml to use workflow_run (triggered after Test Suite completes), which broke the approve-to-run flow for fork PRs. Revert to the original pull_request + pull_request_review triggers while keeping improvements (frontier_amd matrix, concurrency group, timeout, run_parallel_benchmarks.sh). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codecov · 2026-02-12T21:47:46Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 44.07%. Comparing base (4c52155) to head (475caa3).

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1135   +/-   ##
=======================================
  Coverage   44.07%   44.07%           
=======================================
  Files          70       70           
  Lines       20431    20431           
  Branches     1974     1974           
=======================================
  Hits         9004     9004           
  Misses      10291    10291           
  Partials     1136     1136

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Write failed test UUIDs to tests/failed_uuids.txt after a test run. In CI, if 1-5 tests fail, automatically re-run just those tests. If 6+ fail, treat it as a real issue and fail immediately. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codeant-ai · 2026-02-12T22:29:08Z

CodeAnt AI is running Incremental review

Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

.github/workflows/test.yml

codeant-ai · 2026-02-12T22:31:33Z

CodeAnt AI Incremental review completed.

cubic-dev-ai

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name=".github/workflows/test.yml">

<violation number="1" location=".github/workflows/test.yml:138">
P2: The initial test run is forced to succeed with `|| true`, so real failures can be masked when `tests/failed_uuids.txt` is missing. Preserve the exit code and fail the job if no retry is performed.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

.github/workflows/test.yml

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In @.github/workflows/test.yml:
- Around line 137-153: The test step currently uses "|| true" after invoking
mfc.sh which hides crashes; remove that and capture the test command's exit
status (run mfc.sh test -v ... and set TEST_EXIT=$? or use if ! ...; then
TEST_EXIT=$? else TEST_EXIT=0) so you can decide later; keep the existing retry
logic that checks tests/failed_uuids.txt (NUM_FAILED, FAILED) and, after that
block, if tests/failed_uuids.txt does not exist and TEST_EXIT is non‑zero then
exit with TEST_EXIT (or otherwise propagate the original non‑zero exit code) to
avoid reporting success on a crash. Ensure you reference the same mfc.sh
invocation and tests/failed_uuids.txt and use TEST_EXIT when making the final
exit decision.

🧹 Nitpick comments (2)

toolchain/mfc/test/test.py (1)

209-216: Race condition: failed_tests is mutated from worker threads without synchronization.

failed_tests is appended to in handle_case (line 505) which runs in worker threads. While CPython's GIL makes list.append atomic, iterating over failed_tests here (line 213) is safe only because sched.sched (line 179) has already joined all workers by this point. This is fine but fragile — a future refactor that moves this block could introduce a bug. A brief comment would help.

.github/workflows/test.yml (1)

144-144: Minor: useless use of cat.

tr '\n' ' ' < tests/failed_uuids.txt is slightly cleaner, though this is purely cosmetic.

.github/workflows/test.yml

Don't mask non-zero exit codes when tests crash before writing failed_uuids.txt. Only suppress the exit code when the file exists (meaning the test framework ran to completion and we can retry). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace squeue exit-code polling with get_job_state() that parses the actual state string (squeue + sacct fallback). Never give up on UNKNOWN state — CI timeout is the backstop. Cancel orphaned SLURM jobs on abnormal monitor exit. Include job state in heartbeats. Incorporates changes from PR MFlowCode#1140. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

.github/scripts/monitor_slurm_job.sh

cubic-dev-ai

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name=".github/scripts/monitor_slurm_job.sh">

<violation number="1" location=".github/scripts/monitor_slurm_job.sh:40">
P2: Guard the `squeue` pipeline so transient command failures don't abort the script under `set -euo pipefail`; otherwise a temporary SLURM outage exits the monitor instead of returning "UNKNOWN".</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

.github/scripts/monitor_slurm_job.sh

Without || exit $?, a failed retry would silently exit 0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codeant-ai · 2026-02-16T00:39:05Z

CodeAnt AI is running Incremental review

Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

codeant-ai · 2026-02-16T00:40:15Z

CodeAnt AI Incremental review completed.

.github/scripts/monitor_slurm_job.sh

.github/workflows/test.yml

toolchain/mfc/test/test.py

Revert test.yml to clean: true (default) — the corrupted build cache from the ioctl failure was causing 100% test failures. The Lustre-safe cleanup is only needed for bench.yml where pr/master are separate trees. Also tune qodo PR reviewer: reduce max findings to 5, lower suggestion depth to medium, and add instructions to focus on correctness over style for CI scripts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

.github/workflows/bench.yml

.github/scripts/monitor_slurm_job.sh

toolchain/mfc/test/test.py

The Lustre-safe cleanup step was wiping the build cache (pr/build/, master/build/), forcing full rebuilds every run. This added ~32 min of build time and pushed NVHPC gpu-omp benchmarks past the 4h SLURM limit. Restore default checkout behavior to preserve build cache across runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codeant-ai · 2026-02-16T14:50:23Z

CodeAnt AI is running Incremental review

Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

codeant-ai · 2026-02-16T14:51:33Z

CodeAnt AI Incremental review completed.

.github/scripts/monitor_slurm_job.sh

.github/workflows/bench.yml

.github/workflows/test.yml

Split Frontier GPU test configs into 2 shards (~75 min each) so they fit within the batch partition's 2h wall time limit. This allows all Frontier SLURM jobs to run concurrently instead of serially on the extended partition (which has a 1-job-per-user limit), reducing total CI wall clock from ~4.5h to ~2h. Changes: - Add --shard CLI argument (e.g., --shard 1/2) with modulo-based round-robin distribution across shards - Switch Frontier submit scripts from extended to batch/hackathon (CFD154 account, 1h59m wall time) - Shard the 3 Frontier GPU matrix entries into 6 (2 shards each) - CPU entries remain unsharded Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai · 2026-02-16T16:01:22Z

Caution

Failed to replace (edit) comment. This is likely due to insufficient permissions or the comment being deleted.

Error details

{"name":"HttpError","status":401,"request":{"method":"PATCH","url":"https://api.github.com/repos/MFlowCode/MFC/issues/comments/3892112835","headers":{"accept":"application/vnd.github.v3+json","user-agent":"octokit.js/0.0.0-development octokit-core.js/7.0.6 Node.js/24","authorization":"token [REDACTED]","content-type":"application/json; charset=utf-8"},"body":{"body":"<!-- This is an auto-generated comment: summarize by coderabbit.ai -->\n<!-- This is an auto-generated comment: failure by coderabbit.ai -->\n\n> [!CAUTION]\n> ## Review failed\n> \n> An error occurred during the review process. Please try again later.\n\n<!-- end of auto-generated comment: failure by coderabbit.ai -->\n\n\n<!-- finishing_touch_checkbox_start -->\n\n<details>\n<summary>✨ Finishing touches</summary>\n\n<details>\n<summary>🧪 Generate unit tests (beta)</summary>\n\n- [ ] <!-- {\"checkboxId\": \"f47ac10b-58cc-4372-a567-0e02b2c3d479\", \"radioGroupId\": \"utg-output-choice-group-unknown_comment_id\"} -->   Create PR with unit tests\n- [ ] <!-- {\"checkboxId\": \"07f1e7d6-8a8e-4e23-9900-8731c2c87f58\", \"radioGroupId\": \"utg-output-choice-group-unknown_comment_id\"} -->   Post copyable unit tests in a comment\n- [ ] <!-- {\"checkboxId\": \"6ba7b810-9dad-11d1-80b4-00c04fd430c8\", \"radioGroupId\": \"utg-output-choice-group-unknown_comment_id\"} -->   Commit unit tests in branch `ci-fixes`\n\n</details>\n\n</details>\n\n<!-- finishing_touch_checkbox_end -->\n\n<!-- announcements_start -->\n\n> [!TIP]\n> [Issue Planner](https://www.coderabbit.ai/issue-planner) is now in beta. Read the [docs](https://docs.coderabbit.ai/issues/planning) and try it out! Share your feedback on [Discord](https://discord.com/invite/coderabbit).\n\n<!-- announcements_end -->\n\n<!-- tips_start -->\n\n---\n\nThanks for using [CodeRabbit](https://coderabbit.ai?utm_source=oss&utm_medium=github&utm_campaign=MFlowCode/MFC&utm_content=1135)! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.\n\n<details>\n<summary>❤️ Share</summary>\n\n- [X](https://twitter.com/intent/tweet?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A&url=https%3A//coderabbit.ai)\n- [Mastodon](https://mastodon.social/share?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A%20https%3A%2F%2Fcoderabbit.ai)\n- [Reddit](https://www.reddit.com/submit?title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&text=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code.%20Check%20it%20out%3A%20https%3A//coderabbit.ai)\n- [LinkedIn](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fcoderabbit.ai&mini=true&title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&summary=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code)\n\n</details>\n\n<sub>Comment `@coderabbitai help` to get the list of available commands and usage tips.</sub>\n\n<!-- tips_end -->"},"request":{"retryCount":1}},"response":{"url":"https://api.github.com/repos/MFlowCode/MFC/issues/comments/3892112835","status":401,"headers":{"access-control-allow-origin":"*","access-control-expose-headers":"ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset","connection":"close","content-security-policy":"default-src 'none'","content-type":"application/json; charset=utf-8","date":"Mon, 16 Feb 2026 16:01:21 GMT","referrer-policy":"origin-when-cross-origin, strict-origin-when-cross-origin","server":"github.com","strict-transport-security":"max-age=31536000; includeSubdomains; preload","vary":"Accept-Encoding, Accept, X-Requested-With","x-content-type-options":"nosniff","x-frame-options":"deny","x-github-media-type":"github.v3; format=json","x-github-request-id":"307C:130B89:9DE087:29E6A2A:69933F51","x-xss-protection":"0"},"data":{"message":"Bad credentials","documentation_url":"https://docs.github.com/rest","status":"401"}}}

toolchain/mfc/test/test.py

cubic-dev-ai

1 issue found across 7 files (changes from recent commits).

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="toolchain/mfc/test/test.py">

<violation number="1" location="toolchain/mfc/test/test.py:103">
P2: Validate the `--shard` argument before using it; as written, invalid or `0` shard counts will raise exceptions (including ZeroDivisionError) during modulo operations.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

toolchain/mfc/test/test.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Move build retry logic from shell scripts to GHA using nick-fields/retry with 60s backoff between attempts. This gives better visibility into retries and lets login node memory pressure subside between attempts. Also reduce build parallelism from -j 8 to -j 4 to lower peak memory on shared Frontier login nodes, and remove the outdated Node 16 version overrides from self-hosted runner env. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codeant-ai · 2026-02-16T19:50:32Z

CodeAnt AI is running Incremental review

Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

.github/workflows/frontier/submit.sh

.github/scripts/monitor_slurm_job.sh

.github/workflows/frontier/build.sh

.github/workflows/frontier_amd/build.sh

.github/workflows/frontier/build.sh

.github/workflows/frontier_amd/build.sh

toolchain/mfc/test/test.py

codeant-ai · 2026-02-16T19:54:28Z

CodeAnt AI Incremental review completed.

Without set -e, the benchmark build loop could silently ignore failures of earlier benchmarks if a later one succeeded, since only the last command's exit code would propagate. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cubic-dev-ai

3 issues found across 4 files (changes from recent commits).

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name=".github/workflows/frontier/build.sh">

<violation number="1" location=".github/workflows/frontier/build.sh:23">
P2: Benchmark loop no longer checks for errors, so a failing benchmark can be masked by later successful runs (script has no `set -e`). Add explicit failure handling to stop on the first failing benchmark.</violation>
</file>

<file name=".github/workflows/frontier_amd/build.sh">

<violation number="1" location=".github/workflows/frontier_amd/build.sh:23">
P2: Failures in benchmark runs can be silently ignored because the loop doesn't check each `./mfc.sh run` exit code (the script doesn't use `set -e`). A failing benchmark earlier in the loop can be masked by a later successful run, letting CI pass incorrectly. Consider exiting non-zero as soon as a benchmark run fails.</violation>
</file>

<file name=".github/workflows/bench.yml">

<violation number="1" location=".github/workflows/bench.yml:106">
P2: `nick-fields/retry@v3` requires `timeout_minutes` or `timeout_seconds`. Without one, the step will fail before running the build command. Add a timeout input to keep the retry step valid.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai · 2026-02-16T19:57:41Z

.github/workflows/frontier/build.sh

-exit 1
+if [ "$run_bench" == "bench" ]; then
+    for dir in benchmarks/*/; do
+        ./mfc.sh run -v "$dir/case.py" --case-optimization -j 4 --dry-run $build_opts


P2: Benchmark loop no longer checks for errors, so a failing benchmark can be masked by later successful runs (script has no set -e). Add explicit failure handling to stop on the first failing benchmark.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At .github/workflows/frontier/build.sh, line 23: <comment>Benchmark loop no longer checks for errors, so a failing benchmark can be masked by later successful runs (script has no `set -e`). Add explicit failure handling to stop on the first failing benchmark.</comment> <file context> @@ -18,39 +18,10 @@ fi -exit 1 +if [ "$run_bench" == "bench" ]; then + for dir in benchmarks/*/; do + ./mfc.sh run -v "$dir/case.py" --case-optimization -j 4 --dry-run $build_opts + done +else </file context>

Suggested change

./mfc.sh run -v "$dir/case.py" --case-optimization -j 4 --dry-run $build_opts

./mfc.sh run -v "$dir/case.py" --case-optimization -j 4 --dry-run $build_opts || exit $?

cubic-dev-ai · 2026-02-16T19:57:41Z

.github/workflows/frontier_amd/build.sh

-exit 1
+if [ "$run_bench" == "bench" ]; then
+    for dir in benchmarks/*/; do
+        ./mfc.sh run -v "$dir/case.py" --case-optimization -j 4 --dry-run $build_opts


P2: Failures in benchmark runs can be silently ignored because the loop doesn't check each ./mfc.sh run exit code (the script doesn't use set -e). A failing benchmark earlier in the loop can be masked by a later successful run, letting CI pass incorrectly. Consider exiting non-zero as soon as a benchmark run fails.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At .github/workflows/frontier_amd/build.sh, line 23: <comment>Failures in benchmark runs can be silently ignored because the loop doesn't check each `./mfc.sh run` exit code (the script doesn't use `set -e`). A failing benchmark earlier in the loop can be masked by a later successful run, letting CI pass incorrectly. Consider exiting non-zero as soon as a benchmark run fails.</comment> <file context> @@ -18,39 +18,10 @@ fi -exit 1 +if [ "$run_bench" == "bench" ]; then + for dir in benchmarks/*/; do + ./mfc.sh run -v "$dir/case.py" --case-optimization -j 4 --dry-run $build_opts + done +else </file context>

Suggested change

./mfc.sh run -v "$dir/case.py" --case-optimization -j 4 --dry-run $build_opts

./mfc.sh run -v "$dir/case.py" --case-optimization -j 4 --dry-run $build_opts || exit 1

cubic-dev-ai · 2026-02-16T19:57:41Z

.github/workflows/bench.yml

-          wait %1 && wait %2
+        uses: nick-fields/retry@v3
+        with:
+          max_attempts: 3


P2: nick-fields/retry@v3 requires timeout_minutes or timeout_seconds. Without one, the step will fail before running the build command. Add a timeout input to keep the retry step valid.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At .github/workflows/bench.yml, line 106: <comment>`nick-fields/retry@v3` requires `timeout_minutes` or `timeout_seconds`. Without one, the step will fail before running the build command. Add a timeout input to keep the retry step valid.</comment> <file context> @@ -104,10 +101,14 @@ jobs: - wait %1 && wait %2 + uses: nick-fields/retry@v3 + with: + max_attempts: 3 + retry_wait_seconds: 60 + command: | </file context>

Suggested change

max_attempts: 3

timeout_minutes: 480

max_attempts: 3

qodo-code-review · 2026-02-16T19:59:11Z

.github/scripts/monitor_slurm_job.sh

 while [ ! -f "$output_file" ]; do
-  # Check if job is still queued/running
-  if squeue -j "$job_id" &>/dev/null; then
-    squeue_retries=0  # Reset on success
-    sleep 5
-  else
-    squeue_retries=$((squeue_retries + 1))
-    if [ $squeue_retries -ge $max_squeue_retries ]; then
-      # Job not in queue and output file doesn't exist
-      if [ ! -f "$output_file" ]; then
-        echo "ERROR: Job $job_id not in queue and output file not created"
+  state=$(get_job_state "$job_id")
+
+  case "$state" in
+    PENDING|CONFIGURING)
+      unknown_count=0
+      sleep 5
+      ;;
+    RUNNING|COMPLETING)
+      unknown_count=0
+      # Job is running but output file not yet visible (NFS delay)
+      sleep 2
+      ;;
+    UNKNOWN)
+      unknown_count=$((unknown_count + 1))
+      # Only print warning periodically to avoid log spam
+      if [ $((unknown_count % 12)) -eq 1 ]; then
+        echo "Warning: Could not query job $job_id state (SLURM may be temporarily unavailable)..."
+      fi
+      sleep 5
+      ;;
+    *)
+      # Terminal state — job finished without creating output
+      if is_terminal_state "$state"; then
+        echo "ERROR: Job $job_id reached terminal state ($state) without creating output file"
        exit 1
      fi
-      break
-    fi
-    # Exponential backoff
-    sleep_time=$((2 ** squeue_retries))
-    echo "Warning: squeue check failed, retrying in ${sleep_time}s..."
-    sleep $sleep_time
-  fi
+      # Unrecognized state, keep waiting
+      sleep 5
+      ;;
+  esac
 done


Suggestion: Add a grace period after a job reaches a terminal state to wait for the output file to appear, preventing premature failures caused by filesystem delays. [possible issue, importance: 8]

Suggested change

while [ ! -f "$output_file" ]; do

# Check if job is still queued/running

if squeue -j "$job_id" &>/dev/null; then

squeue_retries=0 # Reset on success

sleep 5

else

squeue_retries=$((squeue_retries + 1))

if [ $squeue_retries -ge $max_squeue_retries ]; then

# Job not in queue and output file doesn't exist

if [ ! -f "$output_file" ]; then

echo "ERROR: Job $job_id not in queue and output file not created"

state=$(get_job_state "$job_id")

case "$state" in

PENDING|CONFIGURING)

unknown_count=0

sleep 5

;;

RUNNING|COMPLETING)

unknown_count=0

# Job is running but output file not yet visible (NFS delay)

sleep 2

;;

UNKNOWN)

unknown_count=$((unknown_count + 1))

# Only print warning periodically to avoid log spam

if [ $((unknown_count % 12)) -eq 1 ]; then

echo "Warning: Could not query job $job_id state (SLURM may be temporarily unavailable)..."

fi

sleep 5

;;

*)

# Terminal state — job finished without creating output

if is_terminal_state "$state"; then

echo "ERROR: Job $job_id reached terminal state ($state) without creating output file"

exit 1

fi

break

fi

# Exponential backoff

sleep_time=$((2 ** squeue_retries))

echo "Warning: squeue check failed, retrying in ${sleep_time}s..."

sleep $sleep_time

fi

# Unrecognized state, keep waiting

sleep 5

;;

esac

done

terminal_grace_seconds=45

terminal_seen_at=0

while [ ! -f "$output_file" ]; do

state=$(get_job_state "$job_id")

case "$state" in

PENDING|CONFIGURING)

unknown_count=0

terminal_seen_at=0

sleep 5

;;

RUNNING|COMPLETING)

unknown_count=0

terminal_seen_at=0

# Job is running but output file not yet visible (NFS delay)

sleep 2

;;

UNKNOWN)

unknown_count=$((unknown_count + 1))

terminal_seen_at=0

# Only print warning periodically to avoid log spam

if [ $((unknown_count % 12)) -eq 1 ]; then

echo "Warning: Could not query job $job_id state (SLURM may be temporarily unavailable)..."

fi

sleep 5

;;

*)

if is_terminal_state "$state"; then

now=$(date +%s)

if [ "$terminal_seen_at" -eq 0 ]; then

terminal_seen_at=$now

fi

if [ $((now - terminal_seen_at)) -lt "$terminal_grace_seconds" ]; then

# Give NFS/FS a chance to publish the output file

sleep 2

continue

fi

echo "ERROR: Job $job_id reached terminal state ($state) without creating output file after ${terminal_grace_seconds}s grace"

exit 1

fi

terminal_seen_at=0

# Unrecognized state, keep waiting

sleep 5

;;

esac

done

sbryngelson and others added 2 commits February 12, 2026 11:51

Switch Phoenix GPU jobs to H200 nodes for faster scheduling

e779ed4

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings February 12, 2026 16:51

qodo-code-review bot added the Review effort 3/5 label Feb 12, 2026

Copilot started reviewing on behalf of sbryngelson February 12, 2026 16:52 View session

codeant-ai bot added the size:XS This PR changes 0-9 lines, ignoring generated files label Feb 12, 2026

qodo-code-review bot reviewed Feb 12, 2026

View reviewed changes

.github/workflows/phoenix/submit-bench.sh Show resolved Hide resolved

cubic-dev-ai bot reviewed Feb 12, 2026

View reviewed changes

codeant-ai bot reviewed Feb 12, 2026

View reviewed changes

.github/workflows/phoenix/submit.sh Outdated Show resolved Hide resolved

Merge branch 'master' into ci-fixes

7faf2d6

Copilot AI reviewed Feb 12, 2026

View reviewed changes

.github/workflows/phoenix/submit.sh Show resolved Hide resolved

.github/workflows/phoenix/submit-bench.sh Show resolved Hide resolved

.github/scripts/monitor_slurm_job.sh Show resolved Hide resolved

codeant-ai bot added size:M This PR changes 30-99 lines, ignoring generated files and removed size:XS This PR changes 0-9 lines, ignoring generated files labels Feb 12, 2026

codeant-ai bot reviewed Feb 12, 2026

View reviewed changes

.github/workflows/test.yml Outdated Show resolved Hide resolved

cubic-dev-ai bot reviewed Feb 12, 2026

View reviewed changes

.github/workflows/test.yml Outdated Show resolved Hide resolved

coderabbitai bot reviewed Feb 12, 2026

View reviewed changes

.github/workflows/test.yml Show resolved Hide resolved

sbryngelson and others added 2 commits February 12, 2026 17:40

qodo-code-review bot reviewed Feb 13, 2026

View reviewed changes

.github/scripts/monitor_slurm_job.sh Show resolved Hide resolved

cubic-dev-ai bot reviewed Feb 13, 2026

View reviewed changes

.github/scripts/monitor_slurm_job.sh Outdated Show resolved Hide resolved

Propagate exit code from test retry command

c2e6543

Without || exit $?, a failed retry would silently exit 0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codeant-ai bot added size:L This PR changes 100-499 lines, ignoring generated files and removed size:L This PR changes 100-499 lines, ignoring generated files labels Feb 16, 2026

qodo-code-review bot reviewed Feb 16, 2026

View reviewed changes

.github/scripts/monitor_slurm_job.sh Show resolved Hide resolved

.github/workflows/test.yml Show resolved Hide resolved

toolchain/mfc/test/test.py Show resolved Hide resolved

qodo-code-review bot reviewed Feb 16, 2026

View reviewed changes

.github/workflows/bench.yml Show resolved Hide resolved

.github/scripts/monitor_slurm_job.sh Show resolved Hide resolved

toolchain/mfc/test/test.py Show resolved Hide resolved

codeant-ai bot added size:L This PR changes 100-499 lines, ignoring generated files and removed size:L This PR changes 100-499 lines, ignoring generated files labels Feb 16, 2026

qodo-code-review bot reviewed Feb 16, 2026

View reviewed changes

.github/scripts/monitor_slurm_job.sh Show resolved Hide resolved

.github/workflows/bench.yml Show resolved Hide resolved

.github/workflows/test.yml Show resolved Hide resolved

qodo-code-review bot reviewed Feb 16, 2026

View reviewed changes

toolchain/mfc/test/test.py Show resolved Hide resolved

cubic-dev-ai bot reviewed Feb 16, 2026

View reviewed changes

toolchain/mfc/test/test.py Outdated Show resolved Hide resolved

sbryngelson and others added 2 commits February 16, 2026 11:19

Validate --shard argument format and bounds

475caa3

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codeant-ai bot added size:L This PR changes 100-499 lines, ignoring generated files and removed size:L This PR changes 100-499 lines, ignoring generated files labels Feb 16, 2026

qodo-code-review bot reviewed Feb 16, 2026

View reviewed changes

.github/workflows/frontier/submit.sh Show resolved Hide resolved

.github/scripts/monitor_slurm_job.sh Show resolved Hide resolved

codeant-ai bot reviewed Feb 16, 2026

View reviewed changes

.github/workflows/frontier/build.sh Show resolved Hide resolved

.github/workflows/frontier_amd/build.sh Show resolved Hide resolved

codeant-ai bot reviewed Feb 16, 2026

View reviewed changes

.github/workflows/frontier/build.sh Show resolved Hide resolved

.github/workflows/frontier_amd/build.sh Show resolved Hide resolved

toolchain/mfc/test/test.py Show resolved Hide resolved

cubic-dev-ai bot reviewed Feb 16, 2026

View reviewed changes

qodo-code-review bot reviewed Feb 16, 2026

View reviewed changes

	./mfc.sh run -v "$dir/case.py" --case-optimization -j 4 --dry-run $build_opts
	./mfc.sh run -v "$dir/case.py" --case-optimization -j 4 --dry-run $build_opts \|\| exit $?

Conversation

sbryngelson commented Feb 12, 2026 • edited by codeant-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Summary

Test plan

CodeAnt-AI Description

What Changed

Impact

Checking Your Pull Request

Talking to CodeAnt AI

Example

Preserve Org Learnings with CodeAnt

Example

Retrigger review

Check Your Repository Health

Uh oh!

codeant-ai bot commented Feb 12, 2026

Thanks for using CodeAnt! 🎉

Uh oh!

qodo-code-review bot commented Feb 12, 2026

PR Reviewer Guide 🔍

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codeant-ai bot commented Feb 12, 2026 • edited by sbryngelson Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Nitpicks 🔍

Uh oh!

codeant-ai bot commented Feb 12, 2026

Uh oh!

coderabbitai bot commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

codeant-ai bot commented Feb 12, 2026

Thanks for using CodeAnt! 🎉

Uh oh!

Uh oh!

codeant-ai bot commented Feb 12, 2026

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codeant-ai bot commented Feb 16, 2026

Thanks for using CodeAnt! 🎉

Uh oh!

sbryngelson commented Feb 12, 2026 •

edited by codeant-ai bot

Loading

codeant-ai bot commented Feb 12, 2026 •

edited by sbryngelson

Loading

coderabbitai bot commented Feb 12, 2026 •

edited

Loading

codecov bot commented Feb 12, 2026 •

edited

Loading