Skip to content

IX-1782: Que health check surfaces unhealthy state if postgres_error on any worker#118

Draft
jcardozagc wants to merge 1 commit intomasterfrom
ix-1782-healthcheck-checks-workers-for-pg-error
Draft

IX-1782: Que health check surfaces unhealthy state if postgres_error on any worker#118
jcardozagc wants to merge 1 commit intomasterfrom
ix-1782-healthcheck-checks-workers-for-pg-error

Conversation

@jcardozagc
Copy link

@jcardozagc jcardozagc commented Feb 25, 2026

During a recent incident, after a CloudSQL maintenance event, Que worker pods can end up in a stale database connection state. Workers handle this internally - on a PG::Error, they return :postgres_error, sleep for the wake interval, and retry. The health check is a hardcoded lambda that always returns 200. Hence, Kubernetes also has no signal to restart the affected pods, and the worker in the incident ended up perpetually retrying a faulty connection every 5 seconds.

Let's track the result of each work cycle on the workers itself and expose it through a healthy? predicate (which purely checks whether the worker is in a postgres_error state on its last work_loop cycle). The health check endpoint now delegates to a new WorkerHealthCheck that checks every worker in the group and returns 503 if any of them are in a unhealthy state, otherwise returns the 200 it has always been returning.

Yes, this health check now can return a non-200 and this is the only 'breaking' change, but should mean Kubernetes has a signal to restart pods when liveness probe threshold gets breached.

@jcardozagc jcardozagc force-pushed the ix-1782-healthcheck-checks-workers-for-pg-error branch 3 times, most recently from 7f5a866 to fcd7731 Compare March 19, 2026 14:55
end

def call(_env)
if @worker_group.workers.all?(&:healthy?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about if we want to say that the app is unhealthy if just one worker has a problem, that could be transient any might mean we end up restarting the pod for a single error which probably isn't what we need.

Maybe we do something like

if @worker_group.workers.any?(&:healthy?)

So we end up only triggering the else when ALL the workers are unhealthy?

During a recent incident, after a CloudSQL maintenance event, Que worker pods can end up in a stale database connection state. Workers handle this internally - on a PG::Error, they return :postgres_error, sleep for the wake interval, and retry. The health check is a hardcoded lambda that always returns 200. Hence, Kubernetes also has no signal to restart the affected pods, and the worker in the incident ended up perpetually retrying a faulty connection every 5 seconds.

Let's track the result of each work cycle on the workers itself and expose it through a healthy? predicate (which purely checks whether the worker is in a postgres_error state). The health check endpoint now delegates to a new WorkerHealthCheck that checks every worker in the group and returns 503 if any of them are in a unhealthy state, otherwise returns the 200 it has always been returning. Yes, this endpoint now actually can return a non-200 and this is the only 'breaking' change, but should mean Kubernetes has a signal to restart pods when liveness probe threshold gets breached.
@jcardozagc jcardozagc force-pushed the ix-1782-healthcheck-checks-workers-for-pg-error branch from fcd7731 to 66be132 Compare March 19, 2026 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants