CI robustness: monitor fix, H200 nodes, workspace pre-clean#1146
CI robustness: monitor fix, H200 nodes, workspace pre-clean#1146sbryngelson wants to merge 8 commits intoMFlowCode:masterfrom
Conversation
read -t 0.1 (sub-second timeout) in a loop with process substitution file descriptors triggers a bash internal error (unwind_frame_run: read_builtin: frame not found) leading to a segfault. Use integer timeout (read -t 1) instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phoenix runners work fine without the max-old-space-size override. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ners The actions/checkout clean step fails with ESTALE errors on NFS-backed storage when build artifacts from previous runs have stale file handles. Pre-clean with rm -rf (which tolerates stale handles) and disable checkout's built-in clean. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
CodeAnt AI is reviewing your PR. Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
|
No actionable comments were generated in the recent review. 🎉 📝 WalkthroughWalkthroughUpdated CI/CD and SLURM scripts: increased non-blocking tail read timeout in the SLURM monitor, replaced deprecated GPU SBATCH flags with Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Nitpicks 🔍
|
|
CodeAnt AI finished reviewing your PR. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @.github/workflows/test.yml:
- Around line 202-203: The "Clean workspace" step uses rm -rf on
"$GITHUB_WORKSPACE"/* which can expand to /* if GITHUB_WORKSPACE is unset;
update that run command to guard the shell variable by using the parameter
expansion ${GITHUB_WORKSPACE:?} (i.e., replace "$GITHUB_WORKSPACE" with
"${GITHUB_WORKSPACE:?}" in the rm invocation) so the job fails fast instead of
risking deletion of root, preserving the existing redirects and || true
behavior.
- Use ${GITHUB_WORKSPACE:?} to fail fast if variable is unset
- Fix ntasks-per-node comment to say "tasks (MPI ranks)" not "cores"
- Fix monitor script comment: polling-based, not non-blocking
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1146 +/- ##
=======================================
Coverage 44.07% 44.07%
=======================================
Files 70 70
Lines 20431 20431
Branches 1974 1974
=======================================
Hits 9004 9004
Misses 10291 10291
Partials 1136 1136 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
CodeAnt AI is running Incremental review Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
|
CodeAnt AI Incremental review completed. |
|
CodeAnt AI is running Incremental review Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
|
CodeAnt AI Incremental review completed. |
User description
Summary
monitor_slurm_job.shcaused by fractionalreadtimeout valuesNODE_OPTIONSmax-old-space-size override from CI workflowTest plan
Split from #1130.
🤖 Generated with Claude Code
Summary by CodeRabbit
CodeAnt-AI Description
Fix CI monitor crash, use H200 GPUs for Phoenix jobs, and avoid NFS ESTALE on self-hosted runners
What Changed
Impact
✅ Fewer CI monitor crashes during job streaming✅ Faster scheduling of Phoenix GPU jobs✅ Fewer ESTALE failures on self-hosted runners💡 Usage Guide
Checking Your Pull Request
Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.
Talking to CodeAnt AI
Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:
This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.
Example
Preserve Org Learnings with CodeAnt
You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:
This helps CodeAnt AI learn and adapt to your team's coding style and standards.
Example
Retrigger review
Ask CodeAnt AI to review the PR again, by typing:
Check Your Repository Health
To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.