fix: krea FP8 + torch.compile incompatibility on H100 (#669)#670
fix: krea FP8 + torch.compile incompatibility on H100 (#669)#670livepeer-tessa wants to merge 1 commit intomainfrom
Conversation
Float8DynamicActivationFloat8WeightConfig is not compatible with torch.compile(fullgraph=False). During warmup on H100 (where compile=True), AOT autograd's gen_alias_from_base calls aten.as_strided on Float8Tensor outputs, which is not implemented in torchao: NotImplementedError: Float8Tensor dispatch: attempting to run unimplemented operator/function: func=<OpOverload(op='aten.as_strided', overload='default')> The crash manifests specifically after longlive (also FP8) because torch._dynamo's compile cache is never reset between pipeline switches, allowing longlive's Float8 dispatch state to persist and influence Krea's subsequent compile attempt. Two fixes: 1. krea_realtime_video/pipeline.py: when FP8 quantization is active, skip block.compile() — the two optimizations are currently mutually exclusive with fullgraph=False. FP8 alone still provides meaningful memory/compute savings on H100 without compile. 2. pipeline_manager.py: call torch._dynamo.reset() on every pipeline unload to clear stale compiled graphs and Float8 dispatch state, preventing cross-pipeline cache pollution. Fixes #669 Signed-off-by: livepeer-robot <robot@livepeer.org>
📝 WalkthroughWalkthroughThe changes implement safeguards against FP8 quantization incompatibilities by preventing block-wide compilation of attention modules when FP8 is active, and clearing the PyTorch Dynamo compilation cache upon pipeline unload to prevent stale compiled graphs from persisting across loads. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
🚀 fal.ai Preview Deployment
TestingConnect to this preview deployment by running this on your branch: 🧪 E2E tests will run automatically against this deployment. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/scope/core/pipelines/krea_realtime_video/pipeline.py (1)
141-160: Add a regression for the LongLive FP8 → Krea reload path.This guard only works in combination with the unload-side reset in
src/scope/server/pipeline_manager.py, and the failure shows up only after the specific unload/reload sequence. Without an automated regression here, either half of the fix can drift and the crash will only reappear during H100 warmup.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/scope/server/pipeline_manager.py`:
- Around line 698-702: The try/except around torch._dynamo.reset() currently
swallows errors and the code still publishes pipeline_unloaded; change the flow
so that if torch._dynamo.reset() raises you do NOT publish pipeline_unloaded and
instead either raise/return early from the unload path or set a persistent
worker-level flag (e.g., worker_force_no_compile = True) that subsequent load
logic will read to force compile=False; update the code paths that publish
pipeline_unloaded and the load logic that checks compile (the logic using
compile=any(... for x in ("h100","hopper"))) to respect this flag so stale
Dynamo/FP8 state cannot cause re-compilation attempts.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7aa4b080-c83b-41e5-9f74-596fcaa22c3b
📒 Files selected for processing (2)
src/scope/core/pipelines/krea_realtime_video/pipeline.pysrc/scope/server/pipeline_manager.py
| try: | ||
| torch._dynamo.reset() | ||
| logger.info("torch._dynamo cache reset") | ||
| except Exception as e: | ||
| logger.warning(f"torch._dynamo reset failed: {e}") |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In PyTorch 2.9.1, if torch._dynamo.reset()is skipped or raises, do previously cachedtorch.compile graphs remain active for later compiles in the same worker process?
💡 Result:
Yes. In PyTorch 2.9.1, TorchDynamo stores torch.compile results in in-memory caches keyed off Python code objects (compiled bytecode + guards). Those cache entries persist for the lifetime of the worker process unless you explicitly clear them. [2], [3]
torch._dynamo.reset() is the API that clears those compile caches and restores Dynamo’s state as if you started a fresh process (it calls reset_code_caches() and clears multiple global Dynamo structures). If you skip it—or it raises before completing—then whatever caches were already created remain and can still be reused by later torch.compile calls in the same process (when the same underlying __code__ objects are encountered and guards match). [1], [2]
Sources:
[1] PyTorch v2.9.1 torch._dynamo.reset() implementation (torch/_dynamo/__init__.py)
[2] PyTorch 2.9 docs: Dynamo cache entries are stored/retrieved from a function’s __code__ object (_debug_get_cache_entry_list)
[3] PyTorch dev-discuss: each traced code object has its own cache entries (compiled bytecode + guards)
🏁 Script executed:
# Check the implementation of _unload_pipeline_by_id_unsafe around lines 698-702
sed -n '680,710p' src/scope/server/pipeline_manager.pyRepository: daydreamlive/scope
Length of output: 1304
🏁 Script executed:
# Check if there's any state tracking for the reset failure or if compile is disabled conditionally
rg "compile\s*=" src/scope/server/pipeline_manager.py | head -20Repository: daydreamlive/scope
Length of output: 90
🏁 Script executed:
# Also check what happens next when a pipeline is loaded - does it check any worker state?
rg "_load_pipeline" src/scope/server/pipeline_manager.py -A 5 | head -40Repository: daydreamlive/scope
Length of output: 1241
🏁 Script executed:
# Check if there's a compile flag or state that gets disabled on reset failure
rg "torch\.compile|compile_kwargs|torch_compile" src/scope/server/pipeline_manager.py | head -20Repository: daydreamlive/scope
Length of output: 142
🏁 Script executed:
# Get more context around the compile decision
sed -n '1,50p' src/scope/server/pipeline_manager.pyRepository: daydreamlive/scope
Length of output: 1280
🏁 Script executed:
# Search for where compile is determined during pipeline load
rg "compile" src/scope/server/pipeline_manager.py -B 3 -A 3 | head -60Repository: daydreamlive/scope
Length of output: 942
🏁 Script executed:
# Check if there's any state field tracking reset failure or worker health
rg "self\._.*reset|self\._.*failed|self\._.*error" src/scope/server/pipeline_manager.py | head -20Repository: daydreamlive/scope
Length of output: 170
🏁 Script executed:
# Check if _load_pipeline_implementation has any fallback or state checking
sed -n '300,400p' src/scope/server/pipeline_manager.py | grep -A 20 "_load_pipeline_implementation"Repository: daydreamlive/scope
Length of output: 44
🏁 Script executed:
# Look for the full _load_pipeline_implementation method
rg "_load_pipeline_implementation" src/scope/server/pipeline_manager.py -A 30 | head -50Repository: daydreamlive/scope
Length of output: 2042
🏁 Script executed:
# Check if there's any exception handling around the compile call or pipeline init
rg "KreaRealtimeVideoPipeline\|compile=" src/scope/server/pipeline_manager.py -B 5 -A 5Repository: daydreamlive/scope
Length of output: 44
🏁 Script executed:
# Search for where KreaRealtimeVideoPipeline is instantiated
rg "KreaRealtimeVideoPipeline" src/scope/server/pipeline_manager.py -B 2 -A 10Repository: daydreamlive/scope
Length of output: 1127
🏁 Script executed:
# Look for the full context around compile decision
sed -n '250,350p' src/scope/server/pipeline_manager.pyRepository: daydreamlive/scope
Length of output: 4177
Don't treat a Dynamo reset failure as a successful unload.
This fix depends on clearing the stale FP8/compile state here. If torch._dynamo.reset() raises, the code still reports pipeline_unloaded unconditionally, and the next krea-realtime-video load will attempt to compile with the same GPU-based decision (compile=any(...for x in ("h100", "hopper"))) while stale compile caches remain active in the worker process. This can re-enter the warmup crash path from the original FP8-to-Krea conflict.
Either fail the transition (prevent pipeline_unloaded publication on reset failure) or latch a worker-level flag to force compile=False on all subsequent loads until the worker restarts.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/scope/server/pipeline_manager.py` around lines 698 - 702, The try/except
around torch._dynamo.reset() currently swallows errors and the code still
publishes pipeline_unloaded; change the flow so that if torch._dynamo.reset()
raises you do NOT publish pipeline_unloaded and instead either raise/return
early from the unload path or set a persistent worker-level flag (e.g.,
worker_force_no_compile = True) that subsequent load logic will read to force
compile=False; update the code paths that publish pipeline_unloaded and the load
logic that checks compile (the logic using compile=any(... for x in
("h100","hopper"))) to respect this flag so stale Dynamo/FP8 state cannot cause
re-compilation attempts.
✅ E2E Tests passed
Test ArtifactsCheck the workflow run for screenshots. |
…ale-cache recompile If torch._dynamo.reset() raises during pipeline unload, stale Dynamo/FP8 compile caches remain active in the worker process. Previously the code swallowed the exception and published pipeline_unloaded unconditionally, leaving the next krea-realtime-video load free to torch.compile against those stale caches — re-entering the warmup crash from the FP8→Krea conflict. Fix: set self._dynamo_reset_failed = True on reset failure. The Krea load path now checks this flag and forces compile=False for the lifetime of the worker, with a clear log warning to restart the process to re-enable compilation. Addresses CodeRabbit review comment on PR #670. Signed-off-by: livepeer-robot <robot@livepeer.org>
livepeer-tessa
left a comment
There was a problem hiding this comment.
Good catch @coderabbitai. Filed #671 to address this.
The fix goes with the worker-level flag approach: self._dynamo_reset_failed is set to True if torch._dynamo.reset() raises. Subsequent krea-realtime-video loads read it and force compile=False for the rest of the process lifetime, with a warning to restart. The unload itself still completes and pipeline_unloaded is still published (memory is freed — we just can't guarantee the Dynamo cache is clean).
Failing the unload entirely felt like the worse option since it'd leave the pipeline in a limbo state and block future load attempts.
Problem
Issue #669 — Krea fails to load on remote inference after using longlive.
Root cause
On H100/Hopper, Krea loads with
compile=True. If FP8 quantization is also active,Float8DynamicActivationFloat8WeightConfig+torch.compile(fullgraph=False)crash during warmup.Why it happens: With
fullgraph=False, graph breaks split the compiled function into sub-graphs. At the boundary between sub-graphs, AOT autograd'sgen_alias_from_basecallsaten.as_stridedon the Float8Tensor output of a compiled linear layer to create an output alias — butas_stridedis not implemented forFloat8Tensorin torchao.Why specifically after longlive:
torch._dynamo.reset()is never called between pipeline switches. After longlive runs with FP8 (no compile), its Float8Tensor dispatch state persists in the global dynamo compilation cache. When Krea then loads with FP8 + compile, the residual cache state triggers the AOT autograd aliasing code path that would otherwise be avoided on a fresh worker.Traceback path
Fix
1.
krea_realtime_video/pipeline.py— When FP8 quantization is active, skipblock.compile(). The two optimisations are currently mutually exclusive underfullgraph=False. FP8 alone still provides meaningful memory/compute savings on H100.2.
pipeline_manager.py— Calltorch._dynamo.reset()on every pipeline unload to flush stale compiled graphs and Float8 dispatch state, preventing cross-pipeline cache pollution.Testing
Reproduce with:
Closes #669
Summary by CodeRabbit