IX-1782: Que health check surfaces unhealthy state if postgres_error on any worker by jcardozagc · Pull Request #118 · gocardless/que

jcardozagc · 2026-02-25T15:04:38Z

During a recent incident, after a CloudSQL maintenance event, Que worker pods can end up in a stale database connection state. Workers handle this internally - on a PG::Error, they return :postgres_error, sleep for the wake interval, and retry. The health check is a hardcoded lambda that always returns 200. Hence, Kubernetes also has no signal to restart the affected pods, and the worker in the incident ended up perpetually retrying a faulty connection every 5 seconds.

Let's track the result of each work cycle on the workers itself and expose it through a healthy? predicate (which purely checks whether the worker is in a postgres_error state on its last work_loop cycle). The health check endpoint now delegates to a new WorkerHealthCheck that checks every worker in the group and returns 503 if any of them are in a unhealthy state, otherwise returns the 200 it has always been returning.

Yes, this health check now can return a non-200 and this is the only 'breaking' change, but should mean Kubernetes has a signal to restart pods when liveness probe threshold gets breached.

stephenbinns · 2026-03-19T15:25:28Z

lib/que/middleware/worker_health_check.rb

+      end
+
+      def call(_env)
+        if @worker_group.workers.all?(&:healthy?)


I'm not sure about if we want to say that the app is unhealthy if just one worker has a problem, that could be transient any might mean we end up restarting the pod for a single error which probably isn't what we need.

Maybe we do something like

if @worker_group.workers.any?(&:healthy?)

So we end up only triggering the else when ALL the workers are unhealthy?

During a recent incident, after a CloudSQL maintenance event, Que worker pods can end up in a stale database connection state. Workers handle this internally - on a PG::Error, they return :postgres_error, sleep for the wake interval, and retry. The health check is a hardcoded lambda that always returns 200. Hence, Kubernetes also has no signal to restart the affected pods, and the worker in the incident ended up perpetually retrying a faulty connection every 5 seconds. Let's track the result of each work cycle on the workers itself and expose it through a healthy? predicate (which purely checks whether the worker is in a postgres_error state). The health check endpoint now delegates to a new WorkerHealthCheck that checks every worker in the group and returns 503 if any of them are in a unhealthy state, otherwise returns the 200 it has always been returning. Yes, this endpoint now actually can return a non-200 and this is the only 'breaking' change, but should mean Kubernetes has a signal to restart pods when liveness probe threshold gets breached.

jcardozagc force-pushed the ix-1782-healthcheck-checks-workers-for-pg-error branch 3 times, most recently from 7f5a866 to fcd7731 Compare March 19, 2026 14:55

stephenbinns reviewed Mar 19, 2026

View reviewed changes

jcardozagc force-pushed the ix-1782-healthcheck-checks-workers-for-pg-error branch from fcd7731 to 66be132 Compare March 19, 2026 15:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IX-1782: Que health check surfaces unhealthy state if postgres_error on any worker#118

IX-1782: Que health check surfaces unhealthy state if postgres_error on any worker#118
jcardozagc wants to merge 1 commit intomasterfrom
ix-1782-healthcheck-checks-workers-for-pg-error

jcardozagc commented Feb 25, 2026 •

edited

Loading

Uh oh!

stephenbinns Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jcardozagc commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stephenbinns Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jcardozagc commented Feb 25, 2026 •

edited

Loading