Skip to content

merging docker health monitor into main#22

Merged
haoruizhou merged 14 commits intomainfrom
docker-health-v2
Mar 20, 2026
Merged

merging docker health monitor into main#22
haoruizhou merged 14 commits intomainfrom
docker-health-v2

Conversation

@shark1Martin
Copy link
Contributor

No description provided.

@shark1Martin shark1Martin requested a review from haoruizhou March 20, 2026 18:04
…on metric

- Drop _best_effort_write_health_to_influx from /api/health-status — health-monitor
  sidecar is the authoritative writer; frontend calls were creating duplicate points
- Add depends_on: data-downloader-api to health-monitor in docker-compose so it
  doesn't log API errors on cold start
- Replace meaningless runs_per_minute (run count / poll interval) with
  last_scan_duration_seconds, computed from started_at/finished_at already tracked
  in scanner_status.json — measures how long the slicks InfluxDB scan actually took
- Expose last_scan_duration_seconds in /api/health-status response
- Add health-monitor env vars to .env.example
Old pre-slicks entries with non-round-hour timestamps would persist
forever because merge_scanned_runs kept all vanished runs, not just
ones with user notes. Now only preserves vanished entries that have
a note, so noise artifacts are cleaned out on each fresh scan.
@wfr-data-acquisition
Copy link
Contributor

Summary of changes in this PR

Original work (shark1Martin)

  • Added health-monitor Docker sidecar that uses the Docker socket to inspect container states and writes metrics to InfluxDB monitoring database (monitor.container, monitor.service, monitor.ping measurements)
  • Exposed /api/health-status endpoint on the data-downloader API so the DAQ website can poll it
  • Updated allowed CORS origins

Review fixes (wfr-data-acquisition)

  • Removed dual-write from /api/health-status: the endpoint was writing to InfluxDB on every frontend poll (every 15s). The health-monitor sidecar is the authoritative writer at its own interval — no need to write from the API too
  • Added depends_on: data-downloader-api to health-monitor in docker-compose so it doesn't start before the API is ready
  • Replaced runs_per_minute metric with last_scan_duration_seconds — measures how long the slicks scan actually took, which is meaningful for monitoring
  • DAQ website updated to display scan duration: "OK · last scan Xs" on the Telemetry Scanner card
  • Added missing env vars to .env.example for health-monitor config

Data Downloader improvements (wfr-data-acquisition)

  • Multi-season table fix: each season now uses its own table name (WFR25 → WFR25, WFR26 → WFR26) instead of a single global INFLUX_TABLE. The old code would query the wrong table for past seasons
  • Past season manual scans: UI now shows a season dropdown next to "Trigger Scan" so you can force a scan of any past season (e.g. WFR25)
  • Scheduled scans scope: periodic worker (6AM daily) only scans the active season (highest year). Past seasons are manual-only
  • Fixed PR bug in services.py: removed dead code block that referenced non-existent self.settings.influx_database attribute — would have caused a runtime crash on the second mark_finish call
  • Ghost run cleanup: merge_scanned_runs was keeping ALL vanished entries forever, causing ~150 artifact entries from the old pre-slicks scanner (non-round-hour timestamps, 1-row windows) to persist indefinitely. Now only preserves vanished entries that have a user note
  • Bumped slicks to >=0.2.1
  • Fixed CI typo: workflow had slicks>=2.0.1 instead of slicks>=0.2.1

@haoruizhou
Copy link
Collaborator

Congrats on your first shipped feature @shark1Martin

Copy link
Collaborator

@haoruizhou haoruizhou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to go

@haoruizhou haoruizhou merged commit e59241d into main Mar 20, 2026
12 checks passed
@haoruizhou haoruizhou deleted the docker-health-v2 branch March 20, 2026 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants