CI robustness: monitor fix, H200 nodes, workspace pre-clean by sbryngelson · Pull Request #1146 · MFlowCode/MFC

sbryngelson · 2026-02-14T14:27:14Z

User description

Summary

Fix bash segfault in monitor_slurm_job.sh caused by fractional read timeout values
Switch Phoenix GPU jobs from L40S/mixed partitions to H200 nodes for faster scheduling
Remove unnecessary NODE_OPTIONS max-old-space-size override from CI workflow
Add workspace pre-clean step to avoid ESTALE errors on NFS-backed self-hosted runners

Test plan

Verify Phoenix GPU jobs schedule on H200 nodes
Verify self-hosted runner jobs clean workspace before checkout
Verify monitor script no longer segfaults on fractional timeout

Split from #1130.

🤖 Generated with Claude Code

Summary by CodeRabbit

Chores
- Updated GPU resource specifications in benchmark and submission workflows to request newer GPU types and increase tasks per node.
- Increased tail-read timeout in job monitoring to reduce polling frequency.
- Improved CI workflow by adding an explicit workspace cleanup step, adjusting checkout behavior, and removing a legacy Node memory setting.

CodeAnt-AI Description

Fix CI monitor crash, use H200 GPUs for Phoenix jobs, and avoid NFS ESTALE on self-hosted runners

What Changed

The Slurm job monitor uses a 1s read timeout so the monitor no longer segfaults and drains remaining job output after completion
Phoenix GPU submissions now request H200 GPUs and increase tasks per node to improve job placement and scheduling
CI workflow removes the Node.js memory override, adds an explicit workspace pre-clean step, and disables checkout's built-in clean to prevent ESTALE errors on NFS-backed self-hosted runners

Impact

✅ Fewer CI monitor crashes during job streaming
✅ Faster scheduling of Phoenix GPU jobs
✅ Fewer ESTALE failures on self-hosted runners

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

read -t 0.1 (sub-second timeout) in a loop with process substitution file descriptors triggers a bash internal error (unwind_frame_run: read_builtin: frame not found) leading to a segfault. Use integer timeout (read -t 1) instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Phoenix runners work fine without the max-old-space-size override. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ners The actions/checkout clean step fails with ESTALE errors on NFS-backed storage when build artifacts from previous runs have stale file handles. Pre-clean with rm -rf (which tolerates stale handles) and disable checkout's built-in clean. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codeant-ai · 2026-02-14T14:27:18Z

CodeAnt AI is reviewing your PR.

Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

qodo-code-review · 2026-02-14T14:27:36Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Scheduler Config The GPU `sbatch` options now specify only `--gres=gpu:H200:2` (and increase `--ntasks-per-node`) without an explicit partition/constraint. Validate that Phoenix’s Slurm configuration accepts this without needing `-p`/constraints, and that the new `ntasks-per-node` aligns with how the job scripts size MPI ranks/threads so runs don’t oversubscribe or underutilize the node. sbatch_gpu_opts="\ #SBATCH --gres=gpu:H200:2 #SBATCH --ntasks-per-node=8 # Number of cores per node required\ " Workspace Cleanup The pre-checkout `rm -rf` workspace cleanup is aggressive and relies on glob patterns for both normal and dotfiles. Validate it behaves correctly on the self-hosted runners (including when the workspace is empty, contains unexpected dot-directories, or has permission/FS quirks) and doesn’t remove anything outside the intended workspace due to symlinks/mount behavior on NFS. - name: Clean workspace run: rm -rf "$GITHUB_WORKSPACE"/* "$GITHUB_WORKSPACE"/.[!.]* 2>/dev/null \|\| true - name: Clone uses: actions/checkout@v4 with: clean: false Behavior Change Switching `read -t` from fractional values to `1` second changes responsiveness/throughput characteristics of log draining and heartbeat updates. Validate the monitor loop still detects stalled output promptly and doesn’t introduce noticeable delays in status checks or job completion detection, especially in high-volume output scenarios. while IFS= read -r -t 1 line <&3 2>/dev/null; do echo "$line" lines_read=$((lines_read + 1)) last_heartbeat=$(date +%s)

coderabbitai · 2026-02-14T14:27:43Z

No actionable comments were generated in the recent review. 🎉

📝 Walkthrough

Walkthrough

Updated CI/CD and SLURM scripts: increased non-blocking tail read timeout in the SLURM monitor, replaced deprecated GPU SBATCH flags with --gres=gpu:H200:2 and increased --ntasks-per-node to 8 in submit scripts, and adjusted GitHub Actions workspace cleanup and NODE_OPTIONS handling.

Changes

Cohort / File(s)	Summary
GPU Resource Configuration `.github/workflows/phoenix/submit-bench.sh`, `.github/workflows/phoenix/submit.sh`	Removed deprecated GPU flags/partition/core directives and group flags; added `--gres=gpu:H200:2` and set `--ntasks-per-node=8`.
SLURM Monitoring `.github/scripts/monitor_slurm_job.sh`	Increased non-blocking read/poll timeout from `0.1`s to `1`s in the main tail-reading loop and final drain loop (polling interval only).
GitHub Actions Workflow `.github/workflows/test.yml`	Removed conditional `NODE_OPTIONS` setting; added a top-level "Clean workspace" step (`rm -rf`) and set `clean: false` on the checkout step.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Shell completion auto-install and pre-commit hook improvements #1124: touches monitor_slurm_job.sh timeouts and GPU SBATCH option updates overlapping these changes.

Poem

🐰 I hopped through scripts with a curious stare,
I slowed the tail's quick ear to listen to air,
Swapped old GPU trinkets for H200 gold,
Cleaned the workspace tidy, neat and bold,
Hop — the jobs purr on, carrots in my hold.

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically summarizes the main changes: monitor fix, H200 nodes, and workspace pre-clean.
Description check	✅ Passed	The description provides comprehensive coverage of all changes, testing approach, and context, though it deviates from the template structure.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Merge Conflict Detection	✅ Passed	✅ No merge conflicts detected when merging into `master`

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

.github/workflows/phoenix/submit-bench.sh

.github/workflows/test.yml

cubic-dev-ai

No issues found across 4 files

Confidence score: 5/5

Automated review surfaced no issues in the provided summaries.
No files require special attention.

codeant-ai · 2026-02-14T14:29:16Z

Nitpicks 🔍

🔒 No security issues identified
⚡ Recommended areas for review GPU partition selection / scheduling The SBATCH GPU options now request `--gres=gpu:H200:2` and increase `--ntasks-per-node` to 8. This may be appropriate for your cluster, but removing explicit partition hints (previously a list of GPU partitions) could make scheduling behavior different on some clusters. Confirm H200 devices exist and that removing partition hints won't cause jobs to be placed on unexpected partitions or wait longer. tail PID capture reliability The script uses process substitution with `exec 3< <(stdbuf -oL -eL tail -f "$output_file" 2>&1)` and then captures `tail_pid=$!`. Depending on the shell and how process substitution is implemented, `$!` may not reliably be the `tail` PID (it can be the PID of a wrapper/subshell). If the PID isn't the tail process, attempts to kill it later may not stop the actual `tail`, leaving orphaned processes. Validate the PID handling across target runners and consider using a FIFO or launching `tail` explicitly in the background to get a reliable PID. Read timeout / responsiveness The non-blocking `read -r -t 1` loop will wait up to 1s when no input is available; this increases latency for status checks and heartbeats compared to the previous short timeout. Verify the new timeout doesn't cause delayed detection of job completion or slow heartbeats in practice (especially for short jobs). Consider whether the 1s timeout vs. the previous fractional timeout and the additional `sleep 1` later in the loop create larger-than-expected polling gaps.

codeant-ai · 2026-02-14T14:30:05Z

CodeAnt AI finished reviewing your PR.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In @.github/workflows/test.yml:
- Around line 202-203: The "Clean workspace" step uses rm -rf on
"$GITHUB_WORKSPACE"/* which can expand to /* if GITHUB_WORKSPACE is unset;
update that run command to guard the shell variable by using the parameter
expansion ${GITHUB_WORKSPACE:?} (i.e., replace "$GITHUB_WORKSPACE" with
"${GITHUB_WORKSPACE:?}" in the rm invocation) so the job fails fast instead of
risking deletion of root, preserving the existing redirects and || true
behavior.

.github/workflows/test.yml

- Use ${GITHUB_WORKSPACE:?} to fail fast if variable is unset - Fix ntasks-per-node comment to say "tasks (MPI ranks)" not "cores" - Fix monitor script comment: polling-based, not non-blocking Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

.github/workflows/phoenix/submit.sh

.github/workflows/phoenix/submit-bench.sh

.github/workflows/test.yml

codecov · 2026-02-14T17:23:22Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 44.07%. Comparing base (4c52155) to head (6ae3e4c).

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1146   +/-   ##
=======================================
  Coverage   44.07%   44.07%           
=======================================
  Files          70       70           
  Lines       20431    20431           
  Branches     1974     1974           
=======================================
  Hits         9004     9004           
  Misses      10291    10291           
  Partials     1136     1136

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codeant-ai · 2026-02-15T00:25:35Z

CodeAnt AI is running Incremental review

Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

.github/workflows/phoenix/submit.sh

.github/workflows/phoenix/submit-bench.sh

.github/workflows/test.yml

codeant-ai · 2026-02-15T00:26:35Z

CodeAnt AI Incremental review completed.

codeant-ai · 2026-02-15T20:59:28Z

CodeAnt AI is running Incremental review

Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

.github/workflows/test.yml

.github/workflows/phoenix/submit.sh

.github/workflows/phoenix/submit-bench.sh

codeant-ai · 2026-02-15T21:00:29Z

CodeAnt AI Incremental review completed.

.github/workflows/phoenix/submit.sh

.github/workflows/phoenix/submit-bench.sh

sbryngelson and others added 4 commits February 14, 2026 09:25

Switch Phoenix GPU jobs to H200 nodes for faster scheduling

1ac123c

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove NODE_OPTIONS from CI workflow

17bdcc8

Phoenix runners work fine without the max-old-space-size override. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings February 14, 2026 14:27

sbryngelson mentioned this pull request Feb 14, 2026

Concurrent Frontier CI: parallel test+bench, consolidated SLURM jobs #1147

Draft

5 tasks

qodo-code-review bot added the Review effort 2/5 label Feb 14, 2026

sbryngelson mentioned this pull request Feb 14, 2026

Concurrent Frontier CI + cancel bench if test fail #1130

Closed

4 tasks

Copilot started reviewing on behalf of sbryngelson February 14, 2026 14:27 View session

codeant-ai bot added the size:S This PR changes 10-29 lines, ignoring generated files label Feb 14, 2026

qodo-code-review bot reviewed Feb 14, 2026

View reviewed changes

.github/workflows/phoenix/submit-bench.sh Show resolved Hide resolved

.github/workflows/test.yml Show resolved Hide resolved

cubic-dev-ai bot reviewed Feb 14, 2026

View reviewed changes

coderabbitai bot reviewed Feb 14, 2026

View reviewed changes

.github/workflows/test.yml Outdated Show resolved Hide resolved

This comment was marked as resolved.

Sign in to view

qodo-code-review bot reviewed Feb 14, 2026

View reviewed changes

.github/workflows/phoenix/submit.sh Show resolved Hide resolved

.github/workflows/phoenix/submit-bench.sh Show resolved Hide resolved

.github/workflows/test.yml Show resolved Hide resolved

Merge branch 'master' into ci-improvements

e9e34b0

codeant-ai bot added size:S This PR changes 10-29 lines, ignoring generated files and removed size:S This PR changes 10-29 lines, ignoring generated files labels Feb 15, 2026

qodo-code-review bot reviewed Feb 15, 2026

View reviewed changes

.github/workflows/phoenix/submit.sh Show resolved Hide resolved

.github/workflows/phoenix/submit-bench.sh Show resolved Hide resolved

.github/workflows/test.yml Show resolved Hide resolved

Merge branch 'master' into ci-improvements

8590a4b

codeant-ai bot added size:S This PR changes 10-29 lines, ignoring generated files and removed size:S This PR changes 10-29 lines, ignoring generated files labels Feb 15, 2026

qodo-code-review bot reviewed Feb 15, 2026

View reviewed changes

.github/workflows/test.yml Show resolved Hide resolved

.github/workflows/phoenix/submit.sh Show resolved Hide resolved

.github/workflows/phoenix/submit-bench.sh Show resolved Hide resolved

Merge branch 'master' into ci-improvements

6ae3e4c

qodo-code-review bot reviewed Feb 15, 2026

View reviewed changes

.github/workflows/phoenix/submit.sh Show resolved Hide resolved

.github/workflows/phoenix/submit-bench.sh Show resolved Hide resolved

sbryngelson closed this Feb 16, 2026

Conversation

sbryngelson commented Feb 14, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Summary

Test plan

Summary by CodeRabbit

CodeAnt-AI Description

What Changed

Impact

Checking Your Pull Request

Talking to CodeAnt AI

Example

Preserve Org Learnings with CodeAnt

Example

Retrigger review

Check Your Repository Health

Uh oh!

codeant-ai bot commented Feb 14, 2026

Thanks for using CodeAnt! 🎉

Uh oh!

qodo-code-review bot commented Feb 14, 2026

PR Reviewer Guide 🔍

Uh oh!

coderabbitai bot commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

codeant-ai bot commented Feb 14, 2026 • edited by sbryngelson Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Nitpicks 🔍

Uh oh!

codeant-ai bot commented Feb 14, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

codeant-ai bot commented Feb 15, 2026

Thanks for using CodeAnt! 🎉

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codeant-ai bot commented Feb 15, 2026

Uh oh!

codeant-ai bot commented Feb 15, 2026

Thanks for using CodeAnt! 🎉

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codeant-ai bot commented Feb 15, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

sbryngelson commented Feb 14, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 14, 2026 •

edited

Loading

codeant-ai bot commented Feb 14, 2026 •

edited by sbryngelson

Loading

codecov bot commented Feb 14, 2026 •

edited

Loading