Have picchu update the replicaset during deployment rampup by ddbenson · Pull Request #602 · Medium/picchu

ddbenson · 2026-02-26T16:49:59Z

Ramp via ReplicaSet

Summary

Fixes production scale-down bug, stuck ramps, and multi-cluster ramp allocation when using autoscalers (HPA/KEDA/WPA).

Core change: During canarying/releasing, Picchu disables the autoscaler and controls ReplicaSet replicas directly. Released revisions keep their autoscaler.

Fixes

1. Production Bug: Released Revisions Losing HPA

Problem: Picchu set Ramping=true for released revisions, deleted their HPAs, and scaled them to divideReplicas(Scale.Default) (often 1). Result: 48 pods → 4.

Fix: Ramping only for canarying/releasing:

Ramping: i.target().Scale.HasAutoscaler() && (i.status.State.Current == "canarying" || i.status.State.Current == "releasing")

2. Scale.Default Defaulting

Problem: Apps with scale: { min: 16, max: 96 } and no default got Scale.Default=1 → ramp stuck at 1–4 pods.

Fix: When Scale.Default==0 and Scale.Min set, default to *Scale.Min.

3. BaseCapacity for Multi-Cluster Ramp

Problem: Old revision at 120 pods (HPA-scaled). CanRampTo needs 12 replicas for 10% traffic. We allocated from Scale.Default (16) → only 4 replicas. Ramp stuck.

Fix: When ramping, use baseCapacity (from other revisions' pod count) instead of Scale.Default for replica allocation. Matches CanRampTo logic.

4. New Revision Under-Allocated (Core Issue)

Problem: New revision at 85% traffic had only 144 pods. We used baseCapacity from other revisions only (163) → allocated 163 max. Total system capacity was 307 (163+144). For 85% we need ~261 pods.

Fix: Use getBaseCapacityForRamp() only (other revisions' pods). Do NOT include our own pods—that caused over-allocation (e.g. tutu: 450+383=833 → new gets 752 → total 820).

4b. Over-Allocation Feedback Loop

Problem: Including our pods in total created a feedback loop: our pods → higher total → more allocation → more pods. ramptest3 grew to 232 for a service that should have 100.

Fix: Cap total at expectedTotalReplicas(Scale.Max, 100). So total = min(base + our_current, Scale.Max).

5. Released Revisions Keep Full HPA Range

Problem: Old revision at 15% traffic kept 163 pods. Preemptive pinning was considered but rejected.

Fix: Released revisions keep full Scale.Min to Scale.Max. HPA downscales based on observed load (lower CPU from less traffic), not preemptive pinning.

6. Jubilee / WPA Ramp Updates

Problem: Jubilee workers use WPA. plan/common.go skipped replica updates for WPA targets entirely, so ramping ReplicaSets stayed at WPA-set count (e.g. 2) → ramp stuck.

Fix: When ramping, update ReplicaSet replicas even for WPA (we delete WPA during ramp).

7. Scale.Min=0 Fallback

Problem: Jubilee workers with scale.min: 0 got getBaseCapacityForRamp()=0 when no other revisions → fell back to Scale.Default=1 → 2 replicas, ramp stuck.

Fix: When Scale.Min=0 and no other revisions, use Scale.Max as baseCapacity.

How It Works

State	Ramping	Replicas controlled by	Autoscaler
canarying	true	Picchu (`divideReplicas(baseCapacity)`)	Deleted
releasing	true	Picchu (`divideReplicas(baseCapacity)`)	Deleted
released	false	HPA/KEDA/WPA (full Scale.Min–Max)	Present
deploying/deployed/pendingrelease	false	HPA (or 1 for deploying)	Present

Ramping mechanics:

ScaleRevision deletes HPA/KEDA/WPA when Ramping=true
SyncRevision sets ReplicaSet replicas and adds picchu.medium.engineering/ramping: "true"
plan/common.go updates ReplicaSet replicas when annotation present (otherwise skips to avoid fighting HPA)

baseCapacity: Mirrors CanRampTo—uses other revisions' pod count (or Scale.Min if none). Ensures we allocate enough replicas to pass the ramp check.

Multiple ramping revisions: Each has its own ReplicaSet; no conflicts. Only newest ramping revision gets traffic incremented.

Files Changed

File	Change
`api/v1alpha1/common.go`	`AnnotationRamping`
`api/v1alpha1/source_defaults.go`	Default `Scale.Default` to `Scale.Min` when unset
`api/v1alpha1/source_defaults_test.go`	New – `SetScaleDefaults` tests
`controllers/incarnation.go`	`Ramping` scope; `getBaseCapacityForRamp`; use baseCapacity when ramping
`controllers/plan/syncRevision.go`	`Ramping` field; set annotation
`controllers/plan/scaleRevision.go`	`Ramping` field; delete autoscalers when true
`plan/common.go`	ReplicaSet updates when ramping annotation present

Tests

Test	Purpose
`TestGenScalePlanRampingByState`	Ramping only for canarying/releasing
`TestScaleRevisionRampingDeletesAutoscalers`	Ramping deletes HPA/KEDA/WPA
`TestScaleRevisionNotRampingCreatesHPA`	Not ramping creates HPA
`TestSyncRevisionRampingAnnotation`	Ramping sets annotation
`TestSyncRevisionRampingUpdatesReplicas`	Ramping allows replica updates
`TestSyncRevisionRampingUpdatesWPAReplicas`	Ramping updates WPA ReplicaSet replicas
`TestSyncRevisionNotRampingPreservesReplicas`	Not ramping preserves replicas
`TestSetScaleDefaults`	Scale.Default defaulting
`TestGetBaseCapacityForRamp`	baseCapacity from other revisions (excludes our pods)
`TestGenScalePlanReleasedKeepsHPA`	Released keeps full Scale.Min/Max; HPA downscales on load

Rollout

Backward compatible: Apps with explicit scale.default unchanged.
Behavior change: Apps with min/max but no default use Scale.Min as ramp baseline.
Production fix: Released revisions no longer lose HPA during deployments.

eddiebarth1 · 2026-02-27T18:33:12Z

I had Cursor & CLaude look this PR over, and it does look like a couple things we should consider.
Cursor feedback is here:
pr_602_ramp_concerns_821839f8.plan.md

Claude agreed with those concerns, along with a couple others:

Part 1: Accuracy of Colleague's Feedback

Concern 1 — Missing Fix 4b (Over-Allocation Cap): CORRECT, HIGH priority

Your colleague is right. The PR description promises a cap at min(base + our_current, Scale.Max), but getBaseCapacityForRamp() at controllers/incarnation.go:296-341 has no upper bound. The normalized capacity calculation at line 338 (normalizedCapacity := float64(totalPods) / (float64(totalTrafficPercent) / 100.0)) can exceed Scale.Max when the old revision was HPA-scaled up during a traffic spike. This would flow uncapped into divideReplicas, over-allocating pods.

Agree this must be addressed before merge.

Concern 2 — Annotation Overwrite: CORRECT, MEDIUM priority

The new rs.Annotations = typed.Annotations line in plan/common.go (the diff shows it added after the replicas logic block) replaces the entire annotation map. Picchu owns these ReplicaSets directly (not via Deployments), so in practice it's unlikely other controllers add annotations — but it's still a correctness issue and fragile.

However, there's an important entanglement your colleague didn't flag (see gaps below).

Concern 3 — Stale Status: CORRECT but inherent to the reconciliation model, LOW priority

This is a fundamental property of all controller-runtime operators. Status always lags by one observe cycle. The PR correctly mirrors CanRampTo's logic, so they'll be stale in the same way. Self-corrects in 1-2 cycles. The logging recommendation is sensible but not blocking.

Concern 4 — Scale.Default Behavior Change: CORRECT, intentional, LOW priority

The defaulting change at api/v1alpha1/source_defaults.go:109-113 is the intended fix for stuck ramps. The colleague correctly identifies that first canaries will start with more replicas. This is the point. Documenting it in release notes is the right call.

Concern 5 — HPA Recreation Race: CORRECT analysis, self-mitigating

Your colleague actually resolves their own concern in the second paragraph. I confirmed: plan.All at plan/composite.go:25-39 executes sequentially (SyncRevision first, then ScaleRevision). When transitioning to released, SyncRevision removes the ramping annotation and ScaleRevision creates the HPA — both in the same reconcile. Between these two calls, the existing replica count is preserved by the common.go logic (when ramping=false, replicas only update to/from zero). No window of uncontrolled replicas.

Concern 6 — genScalePlan Condition Ordering: CORRECT, LOW priority

A released revision at CurrentPercent == 0 hits the min = max branch (pinned HPA). It never reaches the new released branch. This is correct behavior: a released revision at 0% traffic is being retired, and pinning prevents scaling oscillation during wind-down.

Concern 7 — Scale.Min=0 Fallback: OVERSTATED

The colleague says first canary "could get 100 replicas" but misses that divideReplicas scales down by traffic percentage and cluster count. With Scale.Max=100, canary at 5%, 4 clusters: 100 * 5/100 / 4 = ~2 replicas per cluster. That's reasonable. The concern is directionally valid (more replicas than Scale.Default would give), but the actual impact is modest.

Concern 8 — Multi-Cluster Heterogeneous Scaling: CORRECT but pre-existing

This is not introduced by the PR. divideReplicas has always divided evenly by cluster count. The PR's getBaseCapacityForRamp mirrors CanRampTo and is consistent with the existing pattern.

Part 2: Gaps the Colleague Missed

Gap A (HIGH) — Annotation overwrite and ramping cleanup are entangled

Concerns #2 and the ramping annotation lifecycle are coupled. Right now, the rs.Annotations = typed.Annotations overwrite is what removes the ramping annotation when Ramping=false — the annotation simply isn't in the new map, so it gets wiped.

If you fix concern #2 (merge annotations instead of replace), you must also explicitly delete AnnotationRamping when Ramping=false. Otherwise a previously-ramping ReplicaSet keeps the annotation forever, and common.go will keep force-updating replicas even after the HPA is recreated. This could fight the HPA.

Gap B (MEDIUM) — Failure state transitions while autoscaler is deleted

During canarying, the HPA/KEDA/WPA is deleted. If the revision then fails (SLO alert, manual rollback) and transitions to failing:

Ramping becomes false (not canarying/releasing)
ScaleRevision will attempt to recreate the HPA on the next reconcile
But there's a window (one reconcile cycle) where no autoscaler exists AND no ramping control is active

The replicas themselves are preserved (common.go's to/from-zero logic), so this isn't a capacity emergency. But it means the revision is running with a fixed replica count until the HPA is recreated and stabilizes. For a revision that just failed its canary, this is likely fine (it's about to lose traffic), but it's worth considering whether the failure path should proactively clean up.

Gap C (MEDIUM) — Duplicate Ramping computation, no single source of truth

Both SyncRevision.Ramping and ScaleRevision.Ramping are independently computed from the same expression in sync():

Ramping: i.target().Scale.HasAutoscaler() && (i.status.State.Current == "canarying" || i.status.State.Current == "releasing")

This appears on lines ~463 and ~585 of the PR diff. If these ever diverge due to a future code change, SyncRevision could set the ramping annotation while ScaleRevision doesn't delete the HPA (or vice versa). A single computed isRamping variable passed to both would be safer.

Gap D (LOW) — No test for the failure-during-ramp path

The PR has good test coverage for the happy paths (ramping states, released states, WPA). But there's no test for what happens when a revision in canarying (Ramping=true, HPA deleted) transitions to failing. Does the next reconcile correctly recreate the HPA and stop ramping?

Summary / Recommendations

Must fix before merge:

Add the Scale.Max cap described in PR Fix 4b (controllers/incarnation.go)
Be deliberate about annotation handling: if you switch to merge-based annotation updates, also explicitly remove AnnotationRamping when Ramping=false

Should fix or explicitly document:
3. Extract a single isRamping bool in sync() and pass it to both plans
4. Add a test for the canarying → failing transition to confirm HPA recreation
5. Document the Scale.Default behavior change in release notes

Acceptable risks (no action needed):

Stale status lag (inherent to controller-runtime)
Even cluster division (pre-existing behavior)
Scale.Min=0 fallback (mitigated by divideReplicas)

ddbenson requested review from camille-merabet, dosemwengie, eddiebarth1, margeauxg-medium, nwbb, rboekweg and sofiegonzalez February 26, 2026 16:50

ddbenson force-pushed the ddbenson/ramp-via-replicaset branch from 21bf832 to 763b006 Compare February 26, 2026 20:48

Have picchu update the replicaset during deployment rampup

7051bb0

ddbenson force-pushed the ddbenson/ramp-via-replicaset branch from 763b006 to 7051bb0 Compare February 26, 2026 22:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Have picchu update the replicaset during deployment rampup#602

Have picchu update the replicaset during deployment rampup#602
ddbenson wants to merge 1 commit intomainfrom
ddbenson/ramp-via-replicaset

ddbenson commented Feb 26, 2026 •

edited

Loading

Uh oh!

eddiebarth1 commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ddbenson commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Ramp via ReplicaSet

Summary

Fixes

1. Production Bug: Released Revisions Losing HPA

2. Scale.Default Defaulting

3. BaseCapacity for Multi-Cluster Ramp

4. New Revision Under-Allocated (Core Issue)

4b. Over-Allocation Feedback Loop

5. Released Revisions Keep Full HPA Range

6. Jubilee / WPA Ramp Updates

7. Scale.Min=0 Fallback

How It Works

Files Changed

Tests

Rollout

Uh oh!

eddiebarth1 commented Feb 27, 2026

Part 1: Accuracy of Colleague's Feedback

Concern 1 — Missing Fix 4b (Over-Allocation Cap): CORRECT, HIGH priority

Concern 2 — Annotation Overwrite: CORRECT, MEDIUM priority

Concern 3 — Stale Status: CORRECT but inherent to the reconciliation model, LOW priority

Concern 4 — Scale.Default Behavior Change: CORRECT, intentional, LOW priority

Concern 5 — HPA Recreation Race: CORRECT analysis, self-mitigating

Concern 6 — genScalePlan Condition Ordering: CORRECT, LOW priority

Concern 7 — Scale.Min=0 Fallback: OVERSTATED

Concern 8 — Multi-Cluster Heterogeneous Scaling: CORRECT but pre-existing

Part 2: Gaps the Colleague Missed

Gap A (HIGH) — Annotation overwrite and ramping cleanup are entangled

Gap B (MEDIUM) — Failure state transitions while autoscaler is deleted

Gap C (MEDIUM) — Duplicate Ramping computation, no single source of truth

Gap D (LOW) — No test for the failure-during-ramp path

Summary / Recommendations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ddbenson commented Feb 26, 2026 •

edited

Loading