fix(logging): recover gracefully when log file deleted mid-session (#711)#712
fix(logging): recover gracefully when log file deleted mid-session (#711)#712livepeer-tessa wants to merge 2 commits intomainfrom
Conversation
Without a heartbeat, aiohttp does not send WebSocket ping frames, so NAT gateways, proxies, and firewalls can silently drop idle TCP connections. This manifests as code=1006 (abnormal closure / no close frame) after 10-30 minutes of use. Set heartbeat=30.0 on ws_connect so aiohttp sends a ping frame every 30 seconds, keeping the connection alive through middleboxes. Fixes #707 Signed-off-by: livepeer-robot <robot@livepeer.org>
On fal.ai workers, the OS may clean up /tmp while the process is still running (issue #711). When RotatingFileHandler's stream is closed (e.g. during doRollover()) and the log directory has been removed, subsequent log calls emit noisy '--- Logging error ---' tracebacks to stderr instead of writing to a log file. Introduce ResilientRotatingFileHandler, a subclass that: - Overrides shouldRollover() to catch FileNotFoundError, recreate the log directory/file via _reopen_stream(), and retry the check. - Overrides emit() to catch FileNotFoundError, recreate the log directory/file via _reopen_stream(), and retry the write. - Falls back to the standard handleError() path only when recovery itself fails (truly unrecoverable errors). Use ResilientRotatingFileHandler in _configure_logging() in app.py instead of the stdlib RotatingFileHandler. Add unit tests covering: normal operation, recovery after directory deletion, recovery after file deletion (stream previously closed), and fallback to handleError on unrecoverable errors. Fixes #711 Signed-off-by: livepeer-robot <robot@livepeer.org>
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment Tip You can disable sequence diagrams in the walkthrough.Disable the |
🚀 fal.ai Preview Deployment
TestingConnect to this preview deployment by running this on your branch: 🧪 E2E tests will run automatically against this deployment. |
✅ E2E Tests passed
Test ArtifactsCheck the workflow run for screenshots. |
Summary
Fixes #711.
On fal.ai workers the OS may clean up
/tmpwhile the process is still running. WhenRotatingFileHandler's stream is closed (e.g. duringdoRollover()) and the log directory has since been removed, every subsequent log call emits a noisy--- Logging error ---traceback to stderr instead of writing to the log file.Root cause
The failure path is:
RotatingFileHandleraccumulates 5 MB → callsdoRollover()doRollover()closes the stream and setsself.stream = None/tmp/.daydream-scope/…/logs/doRollover()(or the nextshouldRollover()call) triesself._open()→FileNotFoundErrorhandleErrordumps a traceback to stderr for every subsequent log recordFix
Introduce
ResilientRotatingFileHandler(inlogs_config.py) — a drop-in subclass ofRotatingFileHandlerthat:shouldRollover()— catchesFileNotFoundError, recreates the log directory + file via_reopen_stream(), retries the rollover check.emit()— catchesFileNotFoundError, recreates the log directory + file, retries the write.handleError()path only when recovery itself fails (truly unrecoverable errors)._configure_logging()inapp.pynow usesResilientRotatingFileHandlerinstead of the stdlib class.Tests
Added
TestResilientRotatingFileHandlerintests/test_logs_config.py:shouldRollover()recovery after directory deletionhandleErrorwhen recovery itself failsAll 22 tests pass.