Have picchu update the replicaset during deployment rampup#602
Have picchu update the replicaset during deployment rampup#602
Conversation
21bf832 to
763b006
Compare
763b006 to
7051bb0
Compare
|
I had Cursor & CLaude look this PR over, and it does look like a couple things we should consider. Claude agreed with those concerns, along with a couple others: Part 1: Accuracy of Colleague's FeedbackConcern 1 — Missing Fix 4b (Over-Allocation Cap): CORRECT, HIGH priorityYour colleague is right. The PR description promises a cap at Agree this must be addressed before merge. Concern 2 — Annotation Overwrite: CORRECT, MEDIUM priorityThe new However, there's an important entanglement your colleague didn't flag (see gaps below). Concern 3 — Stale Status: CORRECT but inherent to the reconciliation model, LOW priorityThis is a fundamental property of all controller-runtime operators. Status always lags by one observe cycle. The PR correctly mirrors Concern 4 — Scale.Default Behavior Change: CORRECT, intentional, LOW priorityThe defaulting change at Concern 5 — HPA Recreation Race: CORRECT analysis, self-mitigatingYour colleague actually resolves their own concern in the second paragraph. I confirmed: Concern 6 — genScalePlan Condition Ordering: CORRECT, LOW priorityA released revision at Concern 7 — Scale.Min=0 Fallback: OVERSTATEDThe colleague says first canary "could get 100 replicas" but misses that Concern 8 — Multi-Cluster Heterogeneous Scaling: CORRECT but pre-existingThis is not introduced by the PR. Part 2: Gaps the Colleague MissedGap A (HIGH) — Annotation overwrite and ramping cleanup are entangledConcerns #2 and the ramping annotation lifecycle are coupled. Right now, the If you fix concern #2 (merge annotations instead of replace), you must also explicitly delete Gap B (MEDIUM) — Failure state transitions while autoscaler is deletedDuring
The replicas themselves are preserved (common.go's to/from-zero logic), so this isn't a capacity emergency. But it means the revision is running with a fixed replica count until the HPA is recreated and stabilizes. For a revision that just failed its canary, this is likely fine (it's about to lose traffic), but it's worth considering whether the failure path should proactively clean up. Gap C (MEDIUM) — Duplicate Ramping computation, no single source of truthBoth Ramping: i.target().Scale.HasAutoscaler() && (i.status.State.Current == "canarying" || i.status.State.Current == "releasing")This appears on lines ~463 and ~585 of the PR diff. If these ever diverge due to a future code change, SyncRevision could set the ramping annotation while ScaleRevision doesn't delete the HPA (or vice versa). A single computed Gap D (LOW) — No test for the failure-during-ramp pathThe PR has good test coverage for the happy paths (ramping states, released states, WPA). But there's no test for what happens when a revision in Summary / RecommendationsMust fix before merge:
Should fix or explicitly document: Acceptable risks (no action needed):
|
Ramp via ReplicaSet
Summary
Fixes production scale-down bug, stuck ramps, and multi-cluster ramp allocation when using autoscalers (HPA/KEDA/WPA).
Core change: During canarying/releasing, Picchu disables the autoscaler and controls ReplicaSet replicas directly. Released revisions keep their autoscaler.
Fixes
1. Production Bug: Released Revisions Losing HPA
Problem: Picchu set
Ramping=truefor released revisions, deleted their HPAs, and scaled them todivideReplicas(Scale.Default)(often 1). Result: 48 pods → 4.Fix:
Rampingonly for canarying/releasing:2. Scale.Default Defaulting
Problem: Apps with
scale: { min: 16, max: 96 }and nodefaultgotScale.Default=1→ ramp stuck at 1–4 pods.Fix: When
Scale.Default==0andScale.Minset, default to*Scale.Min.3. BaseCapacity for Multi-Cluster Ramp
Problem: Old revision at 120 pods (HPA-scaled).
CanRampToneeds 12 replicas for 10% traffic. We allocated fromScale.Default(16) → only 4 replicas. Ramp stuck.Fix: When ramping, use baseCapacity (from other revisions' pod count) instead of
Scale.Defaultfor replica allocation. MatchesCanRampTologic.4. New Revision Under-Allocated (Core Issue)
Problem: New revision at 85% traffic had only 144 pods. We used baseCapacity from other revisions only (163) → allocated 163 max. Total system capacity was 307 (163+144). For 85% we need ~261 pods.
Fix: Use
getBaseCapacityForRamp()only (other revisions' pods). Do NOT include our own pods—that caused over-allocation (e.g. tutu: 450+383=833 → new gets 752 → total 820).4b. Over-Allocation Feedback Loop
Problem: Including our pods in total created a feedback loop: our pods → higher total → more allocation → more pods. ramptest3 grew to 232 for a service that should have 100.
Fix: Cap total at
expectedTotalReplicas(Scale.Max, 100). So total = min(base + our_current, Scale.Max).5. Released Revisions Keep Full HPA Range
Problem: Old revision at 15% traffic kept 163 pods. Preemptive pinning was considered but rejected.
Fix: Released revisions keep full Scale.Min to Scale.Max. HPA downscales based on observed load (lower CPU from less traffic), not preemptive pinning.
6. Jubilee / WPA Ramp Updates
Problem: Jubilee workers use WPA.
plan/common.goskipped replica updates for WPA targets entirely, so ramping ReplicaSets stayed at WPA-set count (e.g. 2) → ramp stuck.Fix: When ramping, update ReplicaSet replicas even for WPA (we delete WPA during ramp).
7. Scale.Min=0 Fallback
Problem: Jubilee workers with
scale.min: 0gotgetBaseCapacityForRamp()=0when no other revisions → fell back toScale.Default=1→ 2 replicas, ramp stuck.Fix: When
Scale.Min=0and no other revisions, useScale.Maxas baseCapacity.How It Works
divideReplicas(baseCapacity))divideReplicas(baseCapacity))Ramping mechanics:
ScaleRevisiondeletes HPA/KEDA/WPA whenRamping=trueSyncRevisionsets ReplicaSet replicas and addspicchu.medium.engineering/ramping: "true"plan/common.goupdates ReplicaSet replicas when annotation present (otherwise skips to avoid fighting HPA)baseCapacity: Mirrors
CanRampTo—uses other revisions' pod count (orScale.Minif none). Ensures we allocate enough replicas to pass the ramp check.Multiple ramping revisions: Each has its own ReplicaSet; no conflicts. Only newest ramping revision gets traffic incremented.
Files Changed
api/v1alpha1/common.goAnnotationRampingapi/v1alpha1/source_defaults.goScale.DefaulttoScale.Minwhen unsetapi/v1alpha1/source_defaults_test.goSetScaleDefaultstestscontrollers/incarnation.goRampingscope;getBaseCapacityForRamp; use baseCapacity when rampingcontrollers/plan/syncRevision.goRampingfield; set annotationcontrollers/plan/scaleRevision.goRampingfield; delete autoscalers when trueplan/common.goTests
TestGenScalePlanRampingByStateTestScaleRevisionRampingDeletesAutoscalersTestScaleRevisionNotRampingCreatesHPATestSyncRevisionRampingAnnotationTestSyncRevisionRampingUpdatesReplicasTestSyncRevisionRampingUpdatesWPAReplicasTestSyncRevisionNotRampingPreservesReplicasTestSetScaleDefaultsTestGetBaseCapacityForRampTestGenScalePlanReleasedKeepsHPARollout
scale.defaultunchanged.min/maxbut nodefaultuseScale.Minas ramp baseline.