fix: normalize kv cache indices to Python ints in krea_realtime_video#682
fix: normalize kv cache indices to Python ints in krea_realtime_video#682livepeer-tessa wants to merge 2 commits intomainfrom
Conversation
On fal.ai GPU-H100 workers torch.cuda.is_available() can return True (CUDA runtime is installed) while actual GPU access later fails with 'No CUDA GPUs are available'. This happens when CUDA_VISIBLE_DEVICES is set to an empty string or an invalid MIG UUID, or when the CUDA context has not yet been initialised and lazy init fails. Plugin pipelines like flashvsr are disproportionately affected because their __init__ immediately allocates CUDA tensors (model loads + warmup pass), unlike built-in pipelines that share an already-established CUDA context. Changes: - pipeline_manager: add _assert_cuda_accessible() that forces lazy CUDA initialisation via a test tensor allocation, reporting device_count and CUDA_VISIBLE_DEVICES on failure. Called before every plugin pipeline load so the error surfaces early with actionable context rather than inside the plugin's __init__. - fal_app: log CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES at startup so future failures can be correlated with the worker environment at a glance. Fixes #675 Signed-off-by: livepeer-robot <robot@livepeer.org>
Fixes #679 After a cache reset, initialize_kv_cache() stores torch.tensor([0], dtype=torch.long) in the global_end_index and local_end_index slots of the KV cache. The krea_realtime_video CausalWanSelfAttention forward pass read these values directly without converting to Python ints, causing all subsequent arithmetic (local_end_index, cache_current_block_start, etc.) to also produce tensors. When cache_current_block_start was captured in score_mod as a closure variable and passed to torch.compile(flex_attention, dynamic=False), flex_attention attempted to re-trace score_mod on every chunk because the captured tensor *object* identity changed each call. The FX tracer then hit the already-compiled flex_attention, triggering: 'Detected that you are using FX to symbolically trace a dynamo-optimized function. This is not supported at the moment.' The _dispatch_keys TypeError followed from FakeTensors (used during trace) colliding with real CUDA tensors captured in the closure. Fix: extract cache_global_end and cache_local_end as Python ints using int() at the top of the caching block. int() safely handles both Python ints (already in the cache after the first chunk) and single-element torch.Tensors (present on the first chunk after a cache reset). Also replace the tensor-based score_mod constants (frame_seqlen_tensor, cache_current_block_start_tensor, log_scale_tensor) with Python scalar literals (_fs, _ccbs, _ls) that become stable graph constants, avoiding both the FX re-trace and the _dispatch_keys collision. Other pipelines (longlive, memflow, streamdiffusionv2) already use .item() for cache index reads for the same reason. Signed-off-by: Tessa <tessa@livepeer.org> Signed-off-by: livepeer-robot <robot@livepeer.org>
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment Tip CodeRabbit can enforce grammar and style rules using `languagetool`.Configure the |
🚀 fal.ai Preview Deployment
TestingConnect to this preview deployment by running this on your branch: 🧪 E2E tests will run automatically against this deployment. |
✅ E2E Tests passed
Test ArtifactsCheck the workflow run for screenshots. |
Summary
Fixes #679
Two related errors were appearing during chunk processing on :
FX symbolic trace of dynamo-optimized function_dispatch_keys TypeError: incompatible function argumentsRoot Cause
After a cache reset,
initialize_kv_cache()storestorch.tensor([0], dtype=torch.long)in theglobal_end_indexandlocal_end_indexslots. Thekrea_realtime_videoCausalWanSelfAttentionread these values without converting to Python ints, unlike every other pipeline (longlive,memflow,streamdiffusionv2— all use.item()).This caused all downstream arithmetic to produce tensors:
When
cache_current_block_start(a tensor) was captured inscore_modand passed totorch.compile(flex_attention, dynamic=False), flex_attention re-tracedscore_modon every chunk because the captured tensor object identity changed. The FX tracer then hit the already-compiled flex_attention, triggering the "FX symbolic trace" error. The_dispatch_keysTypeError followed from FakeTensors colliding with real CUDA tensors during that re-trace.Fix
Two changes to
CausalWanSelfAttention.forward():1. Normalize cache indices to Python ints at the start of the caching block:
int()safely handles both Python ints (after the first chunk) and single-element tensors (first chunk after cache reset).2. Use Python scalar literals in
score_modinstead of freshly-created CUDA tensors:Python scalars become stable graph constants —
torch.compilecaptures them once and never re-traces, regardless of their value changing between chunks.Testing
The fix can be verified by:
krea-realtime-videoon a GPU worker_dispatch_keyserrors in logs