Skip to content

fix: add CUDA pre-flight guard for plugin pipelines (fixes #675)#676

Open
livepeer-tessa wants to merge 1 commit intomainfrom
tessa/fix/flashvsr-cuda-guard-675
Open

fix: add CUDA pre-flight guard for plugin pipelines (fixes #675)#676
livepeer-tessa wants to merge 1 commit intomainfrom
tessa/fix/flashvsr-cuda-guard-675

Conversation

@livepeer-tessa
Copy link
Contributor

Problem

Issue #675 — flashvsr pipeline fails to load with 'No CUDA GPUs are available' on GPU-H100 fal.ai workers.

Root cause

torch.cuda.is_available() only checks that the CUDA runtime is installed — it does not actually initialise a device or allocate memory. On fal.ai H100 workers that use MIG (Multi-Instance GPU) partitioning, or where CUDA_VISIBLE_DEVICES is set to an empty/invalid value by the container runtime, is_available() returns True but the very first CUDA tensor allocation fails with the generic PyTorch error:

RuntimeError: No CUDA GPUs are available

Plugin pipelines like flashvsr are disproportionately affected because their __init__ immediately allocates CUDA tensors (model loads + a full warmup pass). Built-in pipelines share a CUDA context that is already established by the time they initialise, so they never hit the race. The cryptic error message and the models-dir hint (consider removing …) in the logged output obscured the true cause.

Fix

src/scope/server/pipeline_manager.py

  • Added _assert_cuda_accessible() — forces lazy CUDA initialisation via a tiny test tensor allocation before handing off to the plugin.
  • Reports device_count and CUDA_VISIBLE_DEVICES in the exception so the problem is immediately diagnosable from logs.
  • Wired into the plugin loading path (the if pipeline_class is not None and pipeline_id not in BUILTIN_PIPELINES branch).

src/scope/cloud/fal_app.py

  • Logs CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES at startup alongside the existing nvidia-smi output, so future failures can be correlated with the worker environment without needing to reproduce.

Testing

All existing tests pass:

  • tests/test_plugin_manager.py — 99 passed
  • tests/test_pipeline_manager_vace.py — 32 passed

Notes for reviewers

The _assert_cuda_accessible() guard is intentionally only called for plugin pipelines, not built-in ones. Built-ins have their own torch.device(cuda) checks and calling the guard there would be redundant. If we want to extend it to built-ins in the future, the function is ready to be dropped in.

On fal.ai GPU-H100 workers torch.cuda.is_available() can return True
(CUDA runtime is installed) while actual GPU access later fails with
'No CUDA GPUs are available'.  This happens when CUDA_VISIBLE_DEVICES
is set to an empty string or an invalid MIG UUID, or when the CUDA
context has not yet been initialised and lazy init fails.

Plugin pipelines like flashvsr are disproportionately affected because
their __init__ immediately allocates CUDA tensors (model loads + warmup
pass), unlike built-in pipelines that share an already-established CUDA
context.

Changes:
- pipeline_manager: add _assert_cuda_accessible() that forces lazy CUDA
  initialisation via a test tensor allocation, reporting device_count
  and CUDA_VISIBLE_DEVICES on failure.  Called before every plugin
  pipeline load so the error surfaces early with actionable context
  rather than inside the plugin's __init__.
- fal_app: log CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES at
  startup so future failures can be correlated with the worker
  environment at a glance.

Fixes #675

Signed-off-by: livepeer-robot <robot@livepeer.org>
@coderabbitai
Copy link

coderabbitai bot commented Mar 13, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4e3f0364-cfba-4be5-841c-9273494d549c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch tessa/fix/flashvsr-cuda-guard-675
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Contributor

🚀 fal.ai Preview Deployment

App ID daydream/scope-pr-676--preview
WebSocket wss://fal.run/daydream/scope-pr-676--preview/ws
Commit 2bb4782

Testing

Connect to this preview deployment by running this on your branch:

uv run build && SCOPE_CLOUD_APP_ID="daydream/scope-pr-676--preview/ws" uv run daydream-scope

🧪 E2E tests will run automatically against this deployment.

@github-actions
Copy link
Contributor

✅ E2E Tests passed

Status passed
fal App daydream/scope-pr-676--preview
Run View logs

Test Artifacts

Check the workflow run for screenshots.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet