Skip to content

fix: pad vace_input_frames to min temporal size to prevent 3x1x1 kernel underflow#674

Open
livepeer-tessa wants to merge 6 commits intomainfrom
fix/vace-temporal-kernel-underflow
Open

fix: pad vace_input_frames to min temporal size to prevent 3x1x1 kernel underflow#674
livepeer-tessa wants to merge 6 commits intomainfrom
fix/vace-temporal-kernel-underflow

Conversation

@livepeer-tessa
Copy link
Contributor

Problem

Fixes #673.

VaceEncodingBlock._encode_with_conditioning hard-crashes with a PyTorch convolution error when the input chunk has fewer than num_frame_per_block × vae_temporal_downsample_factor (= 12) frames:

RuntimeError: Calculated padded input size per channel: (2 x 64 x 64).
Kernel size: (3 x 1 x 1). Kernel size can't be greater than actual input size

The WAN VAE encoder contains a 3×1×1 temporal convolution kernel. 8 pixel-space frames → 2 latent-space frames, which is below the kernel's minimum of 3.

Observed in prod on 2026-03-12 for the streamdiffusionv2 pipeline with VACE conditioning.

Solution

In _encode_with_conditioning, detect when num_frames < min_frames and pad to min_frames by repeating the last input frame (and last mask frame when vace_input_masks is also supplied). A WARNING is logged for observability without crashing.

Changes

  • src/scope/core/pipelines/wan2_1/vace/blocks/vace_encoding.py: Add temporal underflow guard before VAE encoding in _encode_with_conditioning

Testing

  • Existing tests should still pass (no behaviour change when num_frames >= min_frames)
  • To reproduce locally: pass a vace_input_frames tensor with shape [1, 3, 8, H, W] to streamdiffusionv2 with VACE conditioning enabled — was crashing before, now pads silently.

livepeer-robot added 4 commits March 11, 2026 18:37
Float8DynamicActivationFloat8WeightConfig is not compatible with
torch.compile(fullgraph=False). During warmup on H100 (where compile=True),
AOT autograd's gen_alias_from_base calls aten.as_strided on Float8Tensor
outputs, which is not implemented in torchao:

  NotImplementedError: Float8Tensor dispatch: attempting to run unimplemented
  operator/function: func=<OpOverload(op='aten.as_strided', overload='default')>

The crash manifests specifically after longlive (also FP8) because
torch._dynamo's compile cache is never reset between pipeline switches,
allowing longlive's Float8 dispatch state to persist and influence Krea's
subsequent compile attempt.

Two fixes:

1. krea_realtime_video/pipeline.py: when FP8 quantization is active, skip
   block.compile() — the two optimizations are currently mutually exclusive
   with fullgraph=False. FP8 alone still provides meaningful memory/compute
   savings on H100 without compile.

2. pipeline_manager.py: call torch._dynamo.reset() on every pipeline unload
   to clear stale compiled graphs and Float8 dispatch state, preventing
   cross-pipeline cache pollution.

Fixes #669

Signed-off-by: livepeer-robot <robot@livepeer.org>
…ale-cache recompile

If torch._dynamo.reset() raises during pipeline unload, stale Dynamo/FP8
compile caches remain active in the worker process. Previously the code
swallowed the exception and published pipeline_unloaded unconditionally,
leaving the next krea-realtime-video load free to torch.compile against
those stale caches — re-entering the warmup crash from the FP8→Krea
conflict.

Fix: set self._dynamo_reset_failed = True on reset failure. The Krea load
path now checks this flag and forces compile=False for the lifetime of the
worker, with a clear log warning to restart the process to re-enable
compilation.

Addresses CodeRabbit review comment on PR #670.

Signed-off-by: livepeer-robot <robot@livepeer.org>
…ompile=False

When compile=False, kv_cache_attention_bias was still being set to
DEFAULT_KV_CACHE_ATTENTION_BIAS (0.3), which causes the warmup loop to enter
the flex_attention code path and trigger torch._dynamo tracing even though no
block.compile() call was ever made. This meant the _dynamo_reset_failed guard
in pipeline_manager.py had no effect on the warmup-induced recompilation.

Fix:
- Import KV_CACHE_ATTENTION_BIAS_DISABLED (1.0) from causal_model and use it
  as the initial kv_cache_attention_bias when compile=False. This sentinel
  makes causal_model.py take the standard attention branch and skip the
  flex_attention/torch.compile path entirely.
- Guard the warmup loop behind 'if compile:' — warmup exists solely to prime
  the compiled flex_attention kernel, so it is a no-op (and harmful) when
  compilation is disabled. Log a message when skipped for observability.

Addresses CodeRabbit review comment on PR #671.

Signed-off-by: livepeer-robot <robot@livepeer.org>
The comment at line 230 already specifies ceil(local_attn_size / num_frame_per_block) + 1,
but the implementation was using floor division (//). When local_attn_size is not
evenly divisible by num_frame_per_block, this meant warmup stopped one iteration early,
leaving the cache short of the steady-state shape and triggering a recompile on the
first live request.

Replace with the ceiling equivalent: (a + b - 1) // b to avoid importing math.

Fixes coderabbitai suggestion on PR #671.

Signed-off-by: livepeer-robot <robot@livepeer.org>
@coderabbitai
Copy link

coderabbitai bot commented Mar 13, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f63327be-e6b4-4872-b7b2-3fedc76f3ee1

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/vace-temporal-kernel-underflow
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

livepeer-robot added 2 commits March 13, 2026 06:21
… underflow

The WAN VAE encoder contains a 3x1x1 temporal convolution kernel.  When
the input chunk has fewer than (num_frame_per_block × vae_temporal_downsample_factor)
frames, the latent temporal dimension after downsampling falls below 3, causing:

  RuntimeError: Calculated padded input size per channel: (2 x 64 x 64).
  Kernel size: (3 x 1 x 1). Kernel size can't be greater than actual input size

Observed in prod logs (2026-03-12) for the streamdiffusionv2 pipeline with
VACE conditioning enabled (fal.ai job 10670fc6).

Fix: in _encode_with_conditioning, detect when num_frames < min_frames and
pad to min_frames by repeating the last input frame (and last mask frame
when vace_input_masks is also provided).  A WARNING is emitted so short
chunks remain visible in logs without crashing.

Related: #557 (same block, different axis — spatial width underflow)
Signed-off-by: livepeer-robot <robot@livepeer.org>
Signed-off-by: livepeer-robot <robot@livepeer.org>
@livepeer-tessa livepeer-tessa force-pushed the fix/vace-temporal-kernel-underflow branch from 2c90d83 to b183ecb Compare March 13, 2026 06:21
@github-actions
Copy link
Contributor

github-actions bot commented Mar 13, 2026

🚀 fal.ai Preview Deployment

App ID daydream/scope-pr-674--preview
WebSocket wss://fal.run/daydream/scope-pr-674--preview/ws
Commit b183ecb

Testing

Connect to this preview deployment by running this on your branch:

uv run build && SCOPE_CLOUD_APP_ID="daydream/scope-pr-674--preview/ws" uv run daydream-scope

🧪 E2E tests will run automatically against this deployment.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 13, 2026

✅ E2E Tests passed

Status passed
fal App daydream/scope-pr-674--preview
Run View logs

Test Artifacts

Check the workflow run for screenshots.

livepeer-tessa pushed a commit that referenced this pull request Mar 15, 2026
…derflow

The WAN VAE encoder contains a 3×3 spatial convolution kernel.  When
the input chunk has spatial dimensions < 3 on either axis the forward
pass raises:

  RuntimeError: Calculated padded input size per channel: (2 x 513).
  Kernel size: (3 x 3). Kernel size can't be greater than actual input size

Observed in prod logs (2026-03-15, 10:48–10:59 UTC) on krea-realtime-video
pipeline, fal.ai job 5193400c-da0f-4eef-8bdd-dd0fdd26c1db: 2 372 errors
over 11 minutes (~4 errors/second) from an input with height=2 pixels.

Fix: in _encode_with_conditioning, detect when height or width < 3 and
pad to the minimum safe size using F.pad.  The corresponding masks tensor
is also padded to keep shapes consistent.  block_state.height/width are
updated so the downstream resolution check still passes.  A WARNING is
emitted so the unusual input remains visible in logs without a crash.

This is the spatial analogue of the 3×1×1 temporal kernel guard (issue #673,
PR #674).

Fixes #557
Signed-off-by: livepeer-robot <robot@livepeer.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

VaceEncodingBlock temporal kernel underflow in streamdiffusionv2: (2 x 64 x 64) input < (3 x 1 x 1) kernel

1 participant