Skip to content

fix(logging): recover gracefully when log file deleted mid-session (#711)#712

Open
livepeer-tessa wants to merge 2 commits intomainfrom
fix/711-resilient-rotating-file-handler
Open

fix(logging): recover gracefully when log file deleted mid-session (#711)#712
livepeer-tessa wants to merge 2 commits intomainfrom
fix/711-resilient-rotating-file-handler

Conversation

@livepeer-tessa
Copy link
Contributor

Summary

Fixes #711.

On fal.ai workers the OS may clean up /tmp while the process is still running. When RotatingFileHandler's stream is closed (e.g. during doRollover()) and the log directory has since been removed, every subsequent log call emits a noisy --- Logging error --- traceback to stderr instead of writing to the log file.

Root cause

The failure path is:

  1. RotatingFileHandler accumulates 5 MB → calls doRollover()
  2. doRollover() closes the stream and sets self.stream = None
  3. Meanwhile the OS deletes /tmp/.daydream-scope/…/logs/
  4. doRollover() (or the next shouldRollover() call) tries self._open()FileNotFoundError
  5. Python's default handleError dumps a traceback to stderr for every subsequent log record

Fix

Introduce ResilientRotatingFileHandler (in logs_config.py) — a drop-in subclass of RotatingFileHandler that:

  • Overrides shouldRollover() — catches FileNotFoundError, recreates the log directory + file via _reopen_stream(), retries the rollover check.
  • Overrides emit() — catches FileNotFoundError, recreates the log directory + file, retries the write.
  • Falls back to the standard handleError() path only when recovery itself fails (truly unrecoverable errors).

_configure_logging() in app.py now uses ResilientRotatingFileHandler instead of the stdlib class.

Tests

Added TestResilientRotatingFileHandler in tests/test_logs_config.py:

  • Normal emit works
  • Recovery after directory deletion (with stream closed)
  • Recovery after file deletion (with stream closed, directory intact)
  • shouldRollover() recovery after directory deletion
  • Fallback to handleError when recovery itself fails

All 22 tests pass.

livepeer-robot added 2 commits March 17, 2026 16:31
Without a heartbeat, aiohttp does not send WebSocket ping frames, so
NAT gateways, proxies, and firewalls can silently drop idle TCP
connections. This manifests as code=1006 (abnormal closure / no close
frame) after 10-30 minutes of use.

Set heartbeat=30.0 on ws_connect so aiohttp sends a ping frame every
30 seconds, keeping the connection alive through middleboxes.

Fixes #707

Signed-off-by: livepeer-robot <robot@livepeer.org>
On fal.ai workers, the OS may clean up /tmp while the process is still
running (issue #711). When RotatingFileHandler's stream is closed (e.g.
during doRollover()) and the log directory has been removed, subsequent
log calls emit noisy '--- Logging error ---' tracebacks to stderr
instead of writing to a log file.

Introduce ResilientRotatingFileHandler, a subclass that:
- Overrides shouldRollover() to catch FileNotFoundError, recreate the
  log directory/file via _reopen_stream(), and retry the check.
- Overrides emit() to catch FileNotFoundError, recreate the
  log directory/file via _reopen_stream(), and retry the write.
- Falls back to the standard handleError() path only when recovery
  itself fails (truly unrecoverable errors).

Use ResilientRotatingFileHandler in _configure_logging() in app.py
instead of the stdlib RotatingFileHandler.

Add unit tests covering: normal operation, recovery after directory
deletion, recovery after file deletion (stream previously closed), and
fallback to handleError on unrecoverable errors.

Fixes #711

Signed-off-by: livepeer-robot <robot@livepeer.org>
@coderabbitai
Copy link

coderabbitai bot commented Mar 18, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 35598090-c9e2-430b-9a4d-5b476deeabcb

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/711-resilient-rotating-file-handler
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can disable sequence diagrams in the walkthrough.

Disable the reviews.sequence_diagrams setting to disable sequence diagrams in the walkthrough.

@github-actions
Copy link
Contributor

🚀 fal.ai Preview Deployment

App ID daydream/scope-pr-712--preview
WebSocket wss://fal.run/daydream/scope-pr-712--preview/ws
Commit ff3bbb6

Testing

Connect to this preview deployment by running this on your branch:

uv run build && SCOPE_CLOUD_APP_ID="daydream/scope-pr-712--preview/ws" uv run daydream-scope

🧪 E2E tests will run automatically against this deployment.

@github-actions
Copy link
Contributor

✅ E2E Tests passed

Status passed
fal App daydream/scope-pr-712--preview
Run View logs

Test Artifacts

Check the workflow run for screenshots.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RotatingFileHandler FileNotFoundError: /tmp log file deleted mid-session on fal.ai workers

1 participant