Skip to content

fix: auto-retry cloud connection on transient failures (#704)#706

Open
livepeer-tessa wants to merge 2 commits intomainfrom
fix/cloud-connection-auto-retry
Open

fix: auto-retry cloud connection on transient failures (#704)#706
livepeer-tessa wants to merge 2 commits intomainfrom
fix/cloud-connection-auto-retry

Conversation

@livepeer-tessa
Copy link
Contributor

Summary

Fixes #704 — users were stuck with a failed remote inference connection and had to manually retry.

Root cause: connect_background made a single attempt to connect. When the fal cloud runner is cold-starting, the backend waits up to 180 s for a ready signal. If the runner takes longer (or a transient network hiccup occurs), the connection times out — but the runner is already starting from that first attempt, so any immediate manual retry succeeds. The user was paying for the cold-start latency without getting the benefit.

Fix: Add retry logic to connect_background (up to 3 attempts, 5 s delay between tries):

  • On failure, inspect the error string. If it looks transient (timeout, connection, network, refused, reset) → wait 5 s and retry.
  • Non-transient errors (bad app_id, auth failures) bail immediately — no point retrying those.
  • connect_stage is updated to "Retrying connection (attempt N/3)..." during the delay so the UI stays informative rather than going silent.

Changes

  • src/scope/server/cloud_connection.pyconnect_background retry loop

Testing

  1. Enable remote inference while the cloud runner is cold (first attempt of the day).
  2. Observe: if the first attempt times out, the UI shows the retry stage text and connects automatically on the next attempt.
  3. Verify non-transient errors (e.g. wrong app_id) still surface immediately without retrying.

livepeer-robot added 2 commits March 15, 2026 18:19
…derflow

The WAN VAE encoder contains a 3×3 spatial convolution kernel.  When
the input chunk has spatial dimensions < 3 on either axis the forward
pass raises:

  RuntimeError: Calculated padded input size per channel: (2 x 513).
  Kernel size: (3 x 3). Kernel size can't be greater than actual input size

Observed in prod logs (2026-03-15, 10:48–10:59 UTC) on krea-realtime-video
pipeline, fal.ai job 5193400c-da0f-4eef-8bdd-dd0fdd26c1db: 2 372 errors
over 11 minutes (~4 errors/second) from an input with height=2 pixels.

Fix: in _encode_with_conditioning, detect when height or width < 3 and
pad to the minimum safe size using F.pad.  The corresponding masks tensor
is also padded to keep shapes consistent.  block_state.height/width are
updated so the downstream resolution check still passes.  A WARNING is
emitted so the unusual input remains visible in logs without a crash.

This is the spatial analogue of the 3×1×1 temporal kernel guard (issue #673,
PR #674).

Fixes #557
Signed-off-by: livepeer-robot <robot@livepeer.org>
When remote inference cold-starts, the background connect task can time
out waiting for the 'ready' signal even though the cloud runner is
already starting up. This left users with a failed connection and no
automatic recovery — they had to manually retry.

Add retry logic to connect_background (up to 3 attempts, 5 s delay):
- On each failure, check if the error is transient (timeout, network,
  connection refused, reset). If so, wait and retry.
- Non-transient errors (auth, config, bad app_id) bail immediately.
- The connect_stage field is updated during the retry delay so the UI
  can show "Retrying connection (attempt N/3)..." instead of going
  silent.

Fixes #704 — users no longer need to manually retry when the cloud
runner cold-starts and the first connection attempt times out.

Signed-off-by: livepeer-robot <robot@livepeer.org>
@livepeer-tessa livepeer-tessa self-assigned this Mar 17, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 17, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8b459b4c-0c4a-454c-aa83-579d3f59134e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/cloud-connection-auto-retry
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can customize the high-level summary generated by CodeRabbit.

Configure the reviews.high_level_summary_instructions setting to provide custom instructions for generating the high-level summary.

@github-actions
Copy link
Contributor

🚀 fal.ai Preview Deployment

App ID daydream/scope-pr-706--preview
WebSocket wss://fal.run/daydream/scope-pr-706--preview/ws
Commit 3f3605a

Testing

Connect to this preview deployment by running this on your branch:

uv run build && SCOPE_CLOUD_APP_ID="daydream/scope-pr-706--preview/ws" uv run daydream-scope

🧪 E2E tests will run automatically against this deployment.

@github-actions
Copy link
Contributor

✅ E2E Tests passed

Status passed
fal App daydream/scope-pr-706--preview
Run View logs

Test Artifacts

Check the workflow run for screenshots.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cannot connect to Scope remote inference

1 participant