fix: auto-retry cloud connection on transient failures (#704) by livepeer-tessa · Pull Request #706 · daydreamlive/scope

livepeer-tessa · 2026-03-17T16:25:20Z

Summary

Fixes #704 — users were stuck with a failed remote inference connection and had to manually retry.

Root cause: connect_background made a single attempt to connect. When the fal cloud runner is cold-starting, the backend waits up to 180 s for a ready signal. If the runner takes longer (or a transient network hiccup occurs), the connection times out — but the runner is already starting from that first attempt, so any immediate manual retry succeeds. The user was paying for the cold-start latency without getting the benefit.

Fix: Add retry logic to connect_background (up to 3 attempts, 5 s delay between tries):

On failure, inspect the error string. If it looks transient (timeout, connection, network, refused, reset) → wait 5 s and retry.
Non-transient errors (bad app_id, auth failures) bail immediately — no point retrying those.
connect_stage is updated to "Retrying connection (attempt N/3)..." during the delay so the UI stays informative rather than going silent.

Changes

src/scope/server/cloud_connection.py — connect_background retry loop

Testing

Enable remote inference while the cloud runner is cold (first attempt of the day).
Observe: if the first attempt times out, the UI shows the retry stage text and connects automatically on the next attempt.
Verify non-transient errors (e.g. wrong app_id) still surface immediately without retrying.

…derflow The WAN VAE encoder contains a 3×3 spatial convolution kernel. When the input chunk has spatial dimensions < 3 on either axis the forward pass raises: RuntimeError: Calculated padded input size per channel: (2 x 513). Kernel size: (3 x 3). Kernel size can't be greater than actual input size Observed in prod logs (2026-03-15, 10:48–10:59 UTC) on krea-realtime-video pipeline, fal.ai job 5193400c-da0f-4eef-8bdd-dd0fdd26c1db: 2 372 errors over 11 minutes (~4 errors/second) from an input with height=2 pixels. Fix: in _encode_with_conditioning, detect when height or width < 3 and pad to the minimum safe size using F.pad. The corresponding masks tensor is also padded to keep shapes consistent. block_state.height/width are updated so the downstream resolution check still passes. A WARNING is emitted so the unusual input remains visible in logs without a crash. This is the spatial analogue of the 3×1×1 temporal kernel guard (issue #673, PR #674). Fixes #557 Signed-off-by: livepeer-robot <robot@livepeer.org>

When remote inference cold-starts, the background connect task can time out waiting for the 'ready' signal even though the cloud runner is already starting up. This left users with a failed connection and no automatic recovery — they had to manually retry. Add retry logic to connect_background (up to 3 attempts, 5 s delay): - On each failure, check if the error is transient (timeout, network, connection refused, reset). If so, wait and retry. - Non-transient errors (auth, config, bad app_id) bail immediately. - The connect_stage field is updated during the retry delay so the UI can show "Retrying connection (attempt N/3)..." instead of going silent. Fixes #704 — users no longer need to manually retry when the cloud runner cold-starts and the first connection attempt times out. Signed-off-by: livepeer-robot <robot@livepeer.org>

coderabbitai · 2026-03-17T16:25:32Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8b459b4c-0c4a-454c-aa83-579d3f59134e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/cloud-connection-auto-retry

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can customize the high-level summary generated by CodeRabbit.

Configure the reviews.high_level_summary_instructions setting to provide custom instructions for generating the high-level summary.

github-actions · 2026-03-17T16:29:28Z

🚀 fal.ai Preview Deployment


App ID	`daydream/scope-pr-706--preview`
WebSocket	`wss://fal.run/daydream/scope-pr-706--preview/ws`
Commit	`3f3605a`

Testing

Connect to this preview deployment by running this on your branch:

uv run build && SCOPE_CLOUD_APP_ID="daydream/scope-pr-706--preview/ws" uv run daydream-scope

🧪 E2E tests will run automatically against this deployment.

github-actions · 2026-03-17T16:34:26Z

✅ E2E Tests passed


Status	passed
fal App	`daydream/scope-pr-706--preview`
Run	View logs

Test Artifacts

Check the workflow run for screenshots.

livepeer-robot added 2 commits March 15, 2026 18:19

livepeer-tessa self-assigned this Mar 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: auto-retry cloud connection on transient failures (#704)#706

fix: auto-retry cloud connection on transient failures (#704)#706
livepeer-tessa wants to merge 2 commits intomainfrom
fix/cloud-connection-auto-retry

livepeer-tessa commented Mar 17, 2026

Uh oh!

coderabbitai bot commented Mar 17, 2026

Review skipped

Uh oh!

github-actions bot commented Mar 17, 2026

Uh oh!

github-actions bot commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

livepeer-tessa commented Mar 17, 2026

Summary

Changes

Testing

Uh oh!

coderabbitai bot commented Mar 17, 2026

Review skipped

Uh oh!

github-actions bot commented Mar 17, 2026

🚀 fal.ai Preview Deployment

Testing

Uh oh!

github-actions bot commented Mar 17, 2026

✅ E2E Tests passed

Test Artifacts

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant