Skip to content

fix(services): HealthChecker.Register immediately populates cache when already running#1976

Open
Fletch153 wants to merge 2 commits intomainfrom
fix/health-checker-register-immediate-update
Open

fix(services): HealthChecker.Register immediately populates cache when already running#1976
Fletch153 wants to merge 2 commits intomainfrom
fix/health-checker-register-immediate-update

Conversation

@Fletch153
Copy link
Copy Markdown

Summary

  • Root cause: When Register() is called on an already-started HealthChecker (e.g. by the job spawner after starting a new job service), the new service's health state is absent from IsHealthy() / IsReady() for up to 15 seconds — the full polling interval — even though Ready() already returns nil.
  • Fix: Register() now calls IfStarted after adding the service, immediately populating c.ready and c.healthy for that service.
  • The servicesMu write lock is scoped to an inner closure and released before calling IfStarted to avoid a deadlock with Start(), which holds the StateMachine lock while calling update() (which acquires servicesMu.RLock).

Changes

  • pkg/services/health.goRegister() triggers an immediate single-service health check when the checker is already running
  • pkg/services/health_test.go — two new tests covering healthy and unhealthy services registered on a running checker

Test plan

  • go test ./pkg/services/... -run TestHealthChecker_Register — both new tests pass
  • go test ./pkg/services/... — no regressions

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 10, 2026

⚠️ API Diff Results - github.com/smartcontractkit/chainlink-common

⚠️ Breaking Changes (7)

package github (1)
  • com/smartcontractkit/chainlink-common/script — 🗑️ Removed
pkg/capabilities/v2/actions/confidentialrelay.SecretsRequestParams (1)
  • OrgID — 🗑️ Removed
pkg/capabilities/v2/actions/confidentialworkflow.(*WorkflowExecution) (1)
  • GetOrgId — 🗑️ Removed
pkg/capabilities/v2/actions/confidentialworkflow.WorkflowExecution (1)
  • OrgId — 🗑️ Removed
pkg/workflows (3)
  • GenerateExecutionIDWithTriggerIndex — 🗑️ Removed

  • GetTriggerIndexFromReferenceID — 🗑️ Removed

  • GetTriggerReferenceID — 🗑️ Removed

✅ Compatible Changes (3)

pkg/workflows/dontime/pb (1)
  • Observations — ➕ Added
pkg/workflows/dontime/pb.(*Observation) (1)
  • GetPruneExecutions — ➕ Added
pkg/workflows/dontime/pb.Observation (1)
  • PruneExecutions — ➕ Added

📄 View full apidiff report

Fletch153 added a commit to smartcontractkit/chainlink that referenced this pull request Apr 10, 2026
HealthChecker.Register() now signals the polling goroutine to run
update() immediately when the checker is already running, eliminating
the up-to-15-second window where a dynamically registered service
(e.g. a spawned job) was absent from IsHealthy() / IsReady() even
though Ready() already returned nil.

Fixes CORE-2386.

Companion chainlink-common PR: smartcontractkit/chainlink-common#1976
@Fletch153 Fletch153 force-pushed the fix/health-checker-register-immediate-update branch from f6b9ba9 to e76df7d Compare April 10, 2026 21:43
Fletch153 added a commit to smartcontractkit/chainlink that referenced this pull request Apr 10, 2026
HealthChecker.Register() now signals the polling goroutine to run
update() immediately when the checker is already running, eliminating
the up-to-15-second window where a dynamically registered service
(e.g. a spawned job) was absent from IsHealthy() / IsReady() even
though Ready() already returned nil.

Fixes CORE-2386.

Companion chainlink-common PR: smartcontractkit/chainlink-common#1976
@Fletch153 Fletch153 force-pushed the fix/health-checker-register-immediate-update branch from e76df7d to d2ac04e Compare April 10, 2026 21:52
Fletch153 added a commit to smartcontractkit/chainlink that referenced this pull request Apr 10, 2026
HealthChecker.Register() now signals the polling goroutine to run
update() immediately when the checker is already running, eliminating
the up-to-15-second window where a dynamically registered service
(e.g. a spawned job) was absent from IsHealthy() / IsReady() even
though Ready() already returned nil.

Fixes CORE-2386.

Companion chainlink-common PR: smartcontractkit/chainlink-common#1976
@Fletch153 Fletch153 force-pushed the fix/health-checker-register-immediate-update branch from d2ac04e to 07a72b4 Compare April 10, 2026 22:26
Fletch153 added a commit to smartcontractkit/chainlink that referenced this pull request Apr 10, 2026
HealthChecker.Register() now signals the polling goroutine to run
update() immediately when the checker is already running, eliminating
the up-to-15-second window where a dynamically registered service
(e.g. a spawned job) was absent from IsHealthy() / IsReady() even
though Ready() already returned nil.

Fixes CORE-2386.

Companion chainlink-common PR: smartcontractkit/chainlink-common#1976
mchain0
mchain0 previously approved these changes Apr 13, 2026
jmank88
jmank88 previously approved these changes Apr 13, 2026
Replace the chRefresh channel mechanism with direct map population in
Register(). Instead of signaling the run() goroutine to call update()
(which re-checks ALL services), Register() now directly populates
c.ready and c.healthy for just the newly registered service. This is
simpler and avoids unnecessary re-polling of all services.
@Fletch153 Fletch153 force-pushed the fix/health-checker-register-immediate-update branch from 111cd0b to 575541a Compare April 13, 2026 13:17
Fix missing servicesMu.Unlock() in Unregister that would cause a
deadlock on every call. Revert unrelated whitespace alignment changes
to struct definition and New() constructor.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants