fix: add CUDA pre-flight guard for plugin pipelines (fixes #675) by livepeer-tessa · Pull Request #676 · daydreamlive/scope

livepeer-tessa · 2026-03-13T06:42:03Z

Problem

Issue #675 — flashvsr pipeline fails to load with 'No CUDA GPUs are available' on GPU-H100 fal.ai workers.

Root cause

torch.cuda.is_available() only checks that the CUDA runtime is installed — it does not actually initialise a device or allocate memory. On fal.ai H100 workers that use MIG (Multi-Instance GPU) partitioning, or where CUDA_VISIBLE_DEVICES is set to an empty/invalid value by the container runtime, is_available() returns True but the very first CUDA tensor allocation fails with the generic PyTorch error:

RuntimeError: No CUDA GPUs are available

Plugin pipelines like flashvsr are disproportionately affected because their __init__ immediately allocates CUDA tensors (model loads + a full warmup pass). Built-in pipelines share a CUDA context that is already established by the time they initialise, so they never hit the race. The cryptic error message and the models-dir hint (consider removing …) in the logged output obscured the true cause.

Fix

src/scope/server/pipeline_manager.py

Added _assert_cuda_accessible() — forces lazy CUDA initialisation via a tiny test tensor allocation before handing off to the plugin.
Reports device_count and CUDA_VISIBLE_DEVICES in the exception so the problem is immediately diagnosable from logs.
Wired into the plugin loading path (the if pipeline_class is not None and pipeline_id not in BUILTIN_PIPELINES branch).

src/scope/cloud/fal_app.py

Logs CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES at startup alongside the existing nvidia-smi output, so future failures can be correlated with the worker environment without needing to reproduce.

Testing

All existing tests pass:

tests/test_plugin_manager.py — 99 passed
tests/test_pipeline_manager_vace.py — 32 passed

Notes for reviewers

The _assert_cuda_accessible() guard is intentionally only called for plugin pipelines, not built-in ones. Built-ins have their own torch.device(cuda) checks and calling the guard there would be redundant. If we want to extend it to built-ins in the future, the function is ready to be dropped in.

On fal.ai GPU-H100 workers torch.cuda.is_available() can return True (CUDA runtime is installed) while actual GPU access later fails with 'No CUDA GPUs are available'. This happens when CUDA_VISIBLE_DEVICES is set to an empty string or an invalid MIG UUID, or when the CUDA context has not yet been initialised and lazy init fails. Plugin pipelines like flashvsr are disproportionately affected because their __init__ immediately allocates CUDA tensors (model loads + warmup pass), unlike built-in pipelines that share an already-established CUDA context. Changes: - pipeline_manager: add _assert_cuda_accessible() that forces lazy CUDA initialisation via a test tensor allocation, reporting device_count and CUDA_VISIBLE_DEVICES on failure. Called before every plugin pipeline load so the error surfaces early with actionable context rather than inside the plugin's __init__. - fal_app: log CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES at startup so future failures can be correlated with the worker environment at a glance. Fixes #675 Signed-off-by: livepeer-robot <robot@livepeer.org>

coderabbitai · 2026-03-13T06:42:13Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4e3f0364-cfba-4be5-841c-9273494d549c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch tessa/fix/flashvsr-cuda-guard-675

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-03-13T06:50:26Z

🚀 fal.ai Preview Deployment


App ID	`daydream/scope-pr-676--preview`
WebSocket	`wss://fal.run/daydream/scope-pr-676--preview/ws`
Commit	`2bb4782`

Testing

Connect to this preview deployment by running this on your branch:

uv run build && SCOPE_CLOUD_APP_ID="daydream/scope-pr-676--preview/ws" uv run daydream-scope

🧪 E2E tests will run automatically against this deployment.

github-actions · 2026-03-13T06:53:55Z

✅ E2E Tests passed


Status	passed
fal App	`daydream/scope-pr-676--preview`
Run	View logs

Test Artifacts

Check the workflow run for screenshots.

livepeer-tessa requested review from emranemran and mjh1 March 13, 2026 06:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add CUDA pre-flight guard for plugin pipelines (fixes #675)#676

fix: add CUDA pre-flight guard for plugin pipelines (fixes #675)#676
livepeer-tessa wants to merge 1 commit intomainfrom
tessa/fix/flashvsr-cuda-guard-675

livepeer-tessa commented Mar 13, 2026

Uh oh!

coderabbitai bot commented Mar 13, 2026

Review skipped

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Conversation

livepeer-tessa commented Mar 13, 2026

Problem

Root cause

Fix

Testing

Notes for reviewers

Uh oh!

coderabbitai bot commented Mar 13, 2026

Review skipped

Uh oh!

github-actions bot commented Mar 13, 2026

🚀 fal.ai Preview Deployment

Testing

Uh oh!

github-actions bot commented Mar 13, 2026

✅ E2E Tests passed

Test Artifacts

Uh oh!

Reviewers

Assignees

Labels

Milestone