fix: auto-retry cloud connection on transient failures (#704)#706
fix: auto-retry cloud connection on transient failures (#704)#706livepeer-tessa wants to merge 2 commits intomainfrom
Conversation
…derflow The WAN VAE encoder contains a 3×3 spatial convolution kernel. When the input chunk has spatial dimensions < 3 on either axis the forward pass raises: RuntimeError: Calculated padded input size per channel: (2 x 513). Kernel size: (3 x 3). Kernel size can't be greater than actual input size Observed in prod logs (2026-03-15, 10:48–10:59 UTC) on krea-realtime-video pipeline, fal.ai job 5193400c-da0f-4eef-8bdd-dd0fdd26c1db: 2 372 errors over 11 minutes (~4 errors/second) from an input with height=2 pixels. Fix: in _encode_with_conditioning, detect when height or width < 3 and pad to the minimum safe size using F.pad. The corresponding masks tensor is also padded to keep shapes consistent. block_state.height/width are updated so the downstream resolution check still passes. A WARNING is emitted so the unusual input remains visible in logs without a crash. This is the spatial analogue of the 3×1×1 temporal kernel guard (issue #673, PR #674). Fixes #557 Signed-off-by: livepeer-robot <robot@livepeer.org>
When remote inference cold-starts, the background connect task can time out waiting for the 'ready' signal even though the cloud runner is already starting up. This left users with a failed connection and no automatic recovery — they had to manually retry. Add retry logic to connect_background (up to 3 attempts, 5 s delay): - On each failure, check if the error is transient (timeout, network, connection refused, reset). If so, wait and retry. - Non-transient errors (auth, config, bad app_id) bail immediately. - The connect_stage field is updated during the retry delay so the UI can show "Retrying connection (attempt N/3)..." instead of going silent. Fixes #704 — users no longer need to manually retry when the cloud runner cold-starts and the first connection attempt times out. Signed-off-by: livepeer-robot <robot@livepeer.org>
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment Tip You can customize the high-level summary generated by CodeRabbit.Configure the |
🚀 fal.ai Preview Deployment
TestingConnect to this preview deployment by running this on your branch: 🧪 E2E tests will run automatically against this deployment. |
✅ E2E Tests passed
Test ArtifactsCheck the workflow run for screenshots. |
Summary
Fixes #704 — users were stuck with a failed remote inference connection and had to manually retry.
Root cause:
connect_backgroundmade a single attempt to connect. When the fal cloud runner is cold-starting, the backend waits up to 180 s for areadysignal. If the runner takes longer (or a transient network hiccup occurs), the connection times out — but the runner is already starting from that first attempt, so any immediate manual retry succeeds. The user was paying for the cold-start latency without getting the benefit.Fix: Add retry logic to
connect_background(up to 3 attempts, 5 s delay between tries):connect_stageis updated to"Retrying connection (attempt N/3)..."during the delay so the UI stays informative rather than going silent.Changes
src/scope/server/cloud_connection.py—connect_backgroundretry loopTesting