fix: add CUDA pre-flight guard for plugin pipelines (fixes #675)#676
Open
livepeer-tessa wants to merge 1 commit intomainfrom
Open
fix: add CUDA pre-flight guard for plugin pipelines (fixes #675)#676livepeer-tessa wants to merge 1 commit intomainfrom
livepeer-tessa wants to merge 1 commit intomainfrom
Conversation
On fal.ai GPU-H100 workers torch.cuda.is_available() can return True (CUDA runtime is installed) while actual GPU access later fails with 'No CUDA GPUs are available'. This happens when CUDA_VISIBLE_DEVICES is set to an empty string or an invalid MIG UUID, or when the CUDA context has not yet been initialised and lazy init fails. Plugin pipelines like flashvsr are disproportionately affected because their __init__ immediately allocates CUDA tensors (model loads + warmup pass), unlike built-in pipelines that share an already-established CUDA context. Changes: - pipeline_manager: add _assert_cuda_accessible() that forces lazy CUDA initialisation via a test tensor allocation, reporting device_count and CUDA_VISIBLE_DEVICES on failure. Called before every plugin pipeline load so the error surfaces early with actionable context rather than inside the plugin's __init__. - fal_app: log CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES at startup so future failures can be correlated with the worker environment at a glance. Fixes #675 Signed-off-by: livepeer-robot <robot@livepeer.org>
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
Contributor
🚀 fal.ai Preview Deployment
TestingConnect to this preview deployment by running this on your branch: 🧪 E2E tests will run automatically against this deployment. |
Contributor
✅ E2E Tests passed
Test ArtifactsCheck the workflow run for screenshots. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Issue #675 — flashvsr pipeline fails to load with
'No CUDA GPUs are available'on GPU-H100 fal.ai workers.Root cause
torch.cuda.is_available()only checks that the CUDA runtime is installed — it does not actually initialise a device or allocate memory. On fal.ai H100 workers that use MIG (Multi-Instance GPU) partitioning, or whereCUDA_VISIBLE_DEVICESis set to an empty/invalid value by the container runtime,is_available()returnsTruebut the very first CUDA tensor allocation fails with the generic PyTorch error:Plugin pipelines like
flashvsrare disproportionately affected because their__init__immediately allocates CUDA tensors (model loads + a full warmup pass). Built-in pipelines share a CUDA context that is already established by the time they initialise, so they never hit the race. The cryptic error message and the models-dir hint (consider removing …) in the logged output obscured the true cause.Fix
src/scope/server/pipeline_manager.py_assert_cuda_accessible()— forces lazy CUDA initialisation via a tiny test tensor allocation before handing off to the plugin.device_countandCUDA_VISIBLE_DEVICESin the exception so the problem is immediately diagnosable from logs.if pipeline_class is not None and pipeline_id not in BUILTIN_PIPELINESbranch).src/scope/cloud/fal_app.pyCUDA_VISIBLE_DEVICESandNVIDIA_VISIBLE_DEVICESat startup alongside the existingnvidia-smioutput, so future failures can be correlated with the worker environment without needing to reproduce.Testing
All existing tests pass:
tests/test_plugin_manager.py— 99 passedtests/test_pipeline_manager_vace.py— 32 passedNotes for reviewers
The
_assert_cuda_accessible()guard is intentionally only called for plugin pipelines, not built-in ones. Built-ins have their owntorch.device(cuda)checks and calling the guard there would be redundant. If we want to extend it to built-ins in the future, the function is ready to be dropped in.