Skip to content

CI robustness: monitor fix, H200 nodes, workspace pre-clean#1146

Closed
sbryngelson wants to merge 8 commits intoMFlowCode:masterfrom
sbryngelson:ci-improvements
Closed

CI robustness: monitor fix, H200 nodes, workspace pre-clean#1146
sbryngelson wants to merge 8 commits intoMFlowCode:masterfrom
sbryngelson:ci-improvements

Conversation

@sbryngelson
Copy link
Member

@sbryngelson sbryngelson commented Feb 14, 2026

User description

Summary

  • Fix bash segfault in monitor_slurm_job.sh caused by fractional read timeout values
  • Switch Phoenix GPU jobs from L40S/mixed partitions to H200 nodes for faster scheduling
  • Remove unnecessary NODE_OPTIONS max-old-space-size override from CI workflow
  • Add workspace pre-clean step to avoid ESTALE errors on NFS-backed self-hosted runners

Test plan

  • Verify Phoenix GPU jobs schedule on H200 nodes
  • Verify self-hosted runner jobs clean workspace before checkout
  • Verify monitor script no longer segfaults on fractional timeout

Split from #1130.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Chores
    • Updated GPU resource specifications in benchmark and submission workflows to request newer GPU types and increase tasks per node.
    • Increased tail-read timeout in job monitoring to reduce polling frequency.
    • Improved CI workflow by adding an explicit workspace cleanup step, adjusting checkout behavior, and removing a legacy Node memory setting.

CodeAnt-AI Description

Fix CI monitor crash, use H200 GPUs for Phoenix jobs, and avoid NFS ESTALE on self-hosted runners

What Changed

  • The Slurm job monitor uses a 1s read timeout so the monitor no longer segfaults and drains remaining job output after completion
  • Phoenix GPU submissions now request H200 GPUs and increase tasks per node to improve job placement and scheduling
  • CI workflow removes the Node.js memory override, adds an explicit workspace pre-clean step, and disables checkout's built-in clean to prevent ESTALE errors on NFS-backed self-hosted runners

Impact

✅ Fewer CI monitor crashes during job streaming
✅ Faster scheduling of Phoenix GPU jobs
✅ Fewer ESTALE failures on self-hosted runners

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

sbryngelson and others added 4 commits February 14, 2026 09:25
read -t 0.1 (sub-second timeout) in a loop with process substitution
file descriptors triggers a bash internal error (unwind_frame_run:
read_builtin: frame not found) leading to a segfault. Use integer
timeout (read -t 1) instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phoenix runners work fine without the max-old-space-size override.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ners

The actions/checkout clean step fails with ESTALE errors on NFS-backed
storage when build artifacts from previous runs have stale file handles.
Pre-clean with rm -rf (which tolerates stale handles) and disable
checkout's built-in clean.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings February 14, 2026 14:27
@codeant-ai
Copy link
Contributor

codeant-ai bot commented Feb 14, 2026

CodeAnt AI is reviewing your PR.


Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

@qodo-code-review
Copy link
Contributor

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Scheduler Config

The GPU sbatch options now specify only --gres=gpu:H200:2 (and increase --ntasks-per-node) without an explicit partition/constraint. Validate that Phoenix’s Slurm configuration accepts this without needing -p/constraints, and that the new ntasks-per-node aligns with how the job scripts size MPI ranks/threads so runs don’t oversubscribe or underutilize the node.

sbatch_gpu_opts="\
#SBATCH --gres=gpu:H200:2
#SBATCH --ntasks-per-node=8       # Number of cores per node required\
"
Workspace Cleanup

The pre-checkout rm -rf workspace cleanup is aggressive and relies on glob patterns for both normal and dotfiles. Validate it behaves correctly on the self-hosted runners (including when the workspace is empty, contains unexpected dot-directories, or has permission/FS quirks) and doesn’t remove anything outside the intended workspace due to symlinks/mount behavior on NFS.

- name: Clean workspace
  run: rm -rf "$GITHUB_WORKSPACE"/* "$GITHUB_WORKSPACE"/.[!.]* 2>/dev/null || true

- name: Clone
  uses: actions/checkout@v4
  with:
    clean: false
Behavior Change

Switching read -t from fractional values to 1 second changes responsiveness/throughput characteristics of log draining and heartbeat updates. Validate the monitor loop still detects stalled output promptly and doesn’t introduce noticeable delays in status checks or job completion detection, especially in high-volume output scenarios.

while IFS= read -r -t 1 line <&3 2>/dev/null; do
  echo "$line"
  lines_read=$((lines_read + 1))
  last_heartbeat=$(date +%s)

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 14, 2026

No actionable comments were generated in the recent review. 🎉


📝 Walkthrough

Walkthrough

Updated CI/CD and SLURM scripts: increased non-blocking tail read timeout in the SLURM monitor, replaced deprecated GPU SBATCH flags with --gres=gpu:H200:2 and increased --ntasks-per-node to 8 in submit scripts, and adjusted GitHub Actions workspace cleanup and NODE_OPTIONS handling.

Changes

Cohort / File(s) Summary
GPU Resource Configuration
.github/workflows/phoenix/submit-bench.sh, .github/workflows/phoenix/submit.sh
Removed deprecated GPU flags/partition/core directives and group flags; added --gres=gpu:H200:2 and set --ntasks-per-node=8.
SLURM Monitoring
.github/scripts/monitor_slurm_job.sh
Increased non-blocking read/poll timeout from 0.1s to 1s in the main tail-reading loop and final drain loop (polling interval only).
GitHub Actions Workflow
.github/workflows/test.yml
Removed conditional NODE_OPTIONS setting; added a top-level "Clean workspace" step (rm -rf) and set clean: false on the checkout step.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐰 I hopped through scripts with a curious stare,
I slowed the tail's quick ear to listen to air,
Swapped old GPU trinkets for H200 gold,
Cleaned the workspace tidy, neat and bold,
Hop — the jobs purr on, carrots in my hold.

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main changes: monitor fix, H200 nodes, and workspace pre-clean.
Description check ✅ Passed The description provides comprehensive coverage of all changes, testing approach, and context, though it deviates from the template structure.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into master

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codeant-ai codeant-ai bot added the size:S This PR changes 10-29 lines, ignoring generated files label Feb 14, 2026
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 4 files

Confidence score: 5/5

  • Automated review surfaced no issues in the provided summaries.
  • No files require special attention.

@codeant-ai
Copy link
Contributor

codeant-ai bot commented Feb 14, 2026

Nitpicks 🔍

🔒 No security issues identified
⚡ Recommended areas for review

  • GPU partition selection / scheduling
    The SBATCH GPU options now request --gres=gpu:H200:2 and increase --ntasks-per-node to 8. This may be appropriate for your cluster, but removing explicit partition hints (previously a list of GPU partitions) could make scheduling behavior different on some clusters. Confirm H200 devices exist and that removing partition hints won't cause jobs to be placed on unexpected partitions or wait longer.

  • tail PID capture reliability
    The script uses process substitution with exec 3< <(stdbuf -oL -eL tail -f "$output_file" 2>&1) and then captures tail_pid=$!. Depending on the shell and how process substitution is implemented, $! may not reliably be the tail PID (it can be the PID of a wrapper/subshell). If the PID isn't the tail process, attempts to kill it later may not stop the actual tail, leaving orphaned processes. Validate the PID handling across target runners and consider using a FIFO or launching tail explicitly in the background to get a reliable PID.

  • Read timeout / responsiveness
    The non-blocking read -r -t 1 loop will wait up to 1s when no input is available; this increases latency for status checks and heartbeats compared to the previous short timeout. Verify the new timeout doesn't cause delayed detection of job completion or slow heartbeats in practice (especially for short jobs). Consider whether the 1s timeout vs. the previous fractional timeout and the additional sleep 1 later in the loop create larger-than-expected polling gaps.

@codeant-ai
Copy link
Contributor

codeant-ai bot commented Feb 14, 2026

CodeAnt AI finished reviewing your PR.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @.github/workflows/test.yml:
- Around line 202-203: The "Clean workspace" step uses rm -rf on
"$GITHUB_WORKSPACE"/* which can expand to /* if GITHUB_WORKSPACE is unset;
update that run command to guard the shell variable by using the parameter
expansion ${GITHUB_WORKSPACE:?} (i.e., replace "$GITHUB_WORKSPACE" with
"${GITHUB_WORKSPACE:?}" in the rm invocation) so the job fails fast instead of
risking deletion of root, preserving the existing redirects and || true
behavior.

This comment was marked as resolved.

- Use ${GITHUB_WORKSPACE:?} to fail fast if variable is unset
- Fix ntasks-per-node comment to say "tasks (MPI ranks)" not "cores"
- Fix monitor script comment: polling-based, not non-blocking

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codecov
Copy link

codecov bot commented Feb 14, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 44.07%. Comparing base (4c52155) to head (6ae3e4c).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1146   +/-   ##
=======================================
  Coverage   44.07%   44.07%           
=======================================
  Files          70       70           
  Lines       20431    20431           
  Branches     1974     1974           
=======================================
  Hits         9004     9004           
  Misses      10291    10291           
  Partials     1136     1136           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@codeant-ai
Copy link
Contributor

codeant-ai bot commented Feb 15, 2026

CodeAnt AI is running Incremental review


Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

@codeant-ai codeant-ai bot added size:S This PR changes 10-29 lines, ignoring generated files and removed size:S This PR changes 10-29 lines, ignoring generated files labels Feb 15, 2026
@codeant-ai
Copy link
Contributor

codeant-ai bot commented Feb 15, 2026

CodeAnt AI Incremental review completed.

@codeant-ai
Copy link
Contributor

codeant-ai bot commented Feb 15, 2026

CodeAnt AI is running Incremental review


Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

@codeant-ai codeant-ai bot added size:S This PR changes 10-29 lines, ignoring generated files and removed size:S This PR changes 10-29 lines, ignoring generated files labels Feb 15, 2026
@codeant-ai
Copy link
Contributor

codeant-ai bot commented Feb 15, 2026

CodeAnt AI Incremental review completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Review effort 2/5 size:S This PR changes 10-29 lines, ignoring generated files

Development

Successfully merging this pull request may close these issues.

1 participant