Skip to content

Have picchu update the replicaset during deployment rampup#602

Open
ddbenson wants to merge 1 commit intomainfrom
ddbenson/ramp-via-replicaset
Open

Have picchu update the replicaset during deployment rampup#602
ddbenson wants to merge 1 commit intomainfrom
ddbenson/ramp-via-replicaset

Conversation

@ddbenson
Copy link
Contributor

@ddbenson ddbenson commented Feb 26, 2026

Ramp via ReplicaSet

Summary

Fixes production scale-down bug, stuck ramps, and multi-cluster ramp allocation when using autoscalers (HPA/KEDA/WPA).

Core change: During canarying/releasing, Picchu disables the autoscaler and controls ReplicaSet replicas directly. Released revisions keep their autoscaler.


Fixes

1. Production Bug: Released Revisions Losing HPA

Problem: Picchu set Ramping=true for released revisions, deleted their HPAs, and scaled them to divideReplicas(Scale.Default) (often 1). Result: 48 pods → 4.

Fix: Ramping only for canarying/releasing:

Ramping: i.target().Scale.HasAutoscaler() && (i.status.State.Current == "canarying" || i.status.State.Current == "releasing")

2. Scale.Default Defaulting

Problem: Apps with scale: { min: 16, max: 96 } and no default got Scale.Default=1 → ramp stuck at 1–4 pods.

Fix: When Scale.Default==0 and Scale.Min set, default to *Scale.Min.

3. BaseCapacity for Multi-Cluster Ramp

Problem: Old revision at 120 pods (HPA-scaled). CanRampTo needs 12 replicas for 10% traffic. We allocated from Scale.Default (16) → only 4 replicas. Ramp stuck.

Fix: When ramping, use baseCapacity (from other revisions' pod count) instead of Scale.Default for replica allocation. Matches CanRampTo logic.

4. New Revision Under-Allocated (Core Issue)

Problem: New revision at 85% traffic had only 144 pods. We used baseCapacity from other revisions only (163) → allocated 163 max. Total system capacity was 307 (163+144). For 85% we need ~261 pods.

Fix: Use getBaseCapacityForRamp() only (other revisions' pods). Do NOT include our own pods—that caused over-allocation (e.g. tutu: 450+383=833 → new gets 752 → total 820).

4b. Over-Allocation Feedback Loop

Problem: Including our pods in total created a feedback loop: our pods → higher total → more allocation → more pods. ramptest3 grew to 232 for a service that should have 100.

Fix: Cap total at expectedTotalReplicas(Scale.Max, 100). So total = min(base + our_current, Scale.Max).

5. Released Revisions Keep Full HPA Range

Problem: Old revision at 15% traffic kept 163 pods. Preemptive pinning was considered but rejected.

Fix: Released revisions keep full Scale.Min to Scale.Max. HPA downscales based on observed load (lower CPU from less traffic), not preemptive pinning.

6. Jubilee / WPA Ramp Updates

Problem: Jubilee workers use WPA. plan/common.go skipped replica updates for WPA targets entirely, so ramping ReplicaSets stayed at WPA-set count (e.g. 2) → ramp stuck.

Fix: When ramping, update ReplicaSet replicas even for WPA (we delete WPA during ramp).

7. Scale.Min=0 Fallback

Problem: Jubilee workers with scale.min: 0 got getBaseCapacityForRamp()=0 when no other revisions → fell back to Scale.Default=1 → 2 replicas, ramp stuck.

Fix: When Scale.Min=0 and no other revisions, use Scale.Max as baseCapacity.


How It Works

State Ramping Replicas controlled by Autoscaler
canarying true Picchu (divideReplicas(baseCapacity)) Deleted
releasing true Picchu (divideReplicas(baseCapacity)) Deleted
released false HPA/KEDA/WPA (full Scale.Min–Max) Present
deploying/deployed/pendingrelease false HPA (or 1 for deploying) Present

Ramping mechanics:

  1. ScaleRevision deletes HPA/KEDA/WPA when Ramping=true
  2. SyncRevision sets ReplicaSet replicas and adds picchu.medium.engineering/ramping: "true"
  3. plan/common.go updates ReplicaSet replicas when annotation present (otherwise skips to avoid fighting HPA)

baseCapacity: Mirrors CanRampTo—uses other revisions' pod count (or Scale.Min if none). Ensures we allocate enough replicas to pass the ramp check.

Multiple ramping revisions: Each has its own ReplicaSet; no conflicts. Only newest ramping revision gets traffic incremented.


Files Changed

File Change
api/v1alpha1/common.go AnnotationRamping
api/v1alpha1/source_defaults.go Default Scale.Default to Scale.Min when unset
api/v1alpha1/source_defaults_test.go NewSetScaleDefaults tests
controllers/incarnation.go Ramping scope; getBaseCapacityForRamp; use baseCapacity when ramping
controllers/plan/syncRevision.go Ramping field; set annotation
controllers/plan/scaleRevision.go Ramping field; delete autoscalers when true
plan/common.go ReplicaSet updates when ramping annotation present

Tests

Test Purpose
TestGenScalePlanRampingByState Ramping only for canarying/releasing
TestScaleRevisionRampingDeletesAutoscalers Ramping deletes HPA/KEDA/WPA
TestScaleRevisionNotRampingCreatesHPA Not ramping creates HPA
TestSyncRevisionRampingAnnotation Ramping sets annotation
TestSyncRevisionRampingUpdatesReplicas Ramping allows replica updates
TestSyncRevisionRampingUpdatesWPAReplicas Ramping updates WPA ReplicaSet replicas
TestSyncRevisionNotRampingPreservesReplicas Not ramping preserves replicas
TestSetScaleDefaults Scale.Default defaulting
TestGetBaseCapacityForRamp baseCapacity from other revisions (excludes our pods)
TestGenScalePlanReleasedKeepsHPA Released keeps full Scale.Min/Max; HPA downscales on load

Rollout

  • Backward compatible: Apps with explicit scale.default unchanged.
  • Behavior change: Apps with min/max but no default use Scale.Min as ramp baseline.
  • Production fix: Released revisions no longer lose HPA during deployments.

@ddbenson ddbenson force-pushed the ddbenson/ramp-via-replicaset branch from 763b006 to 7051bb0 Compare February 26, 2026 22:10
@eddiebarth1
Copy link
Contributor

I had Cursor & CLaude look this PR over, and it does look like a couple things we should consider.
Cursor feedback is here:
pr_602_ramp_concerns_821839f8.plan.md

Claude agreed with those concerns, along with a couple others:

Part 1: Accuracy of Colleague's Feedback

Concern 1 — Missing Fix 4b (Over-Allocation Cap): CORRECT, HIGH priority

Your colleague is right. The PR description promises a cap at min(base + our_current, Scale.Max), but getBaseCapacityForRamp() at controllers/incarnation.go:296-341 has no upper bound. The normalized capacity calculation at line 338 (normalizedCapacity := float64(totalPods) / (float64(totalTrafficPercent) / 100.0)) can exceed Scale.Max when the old revision was HPA-scaled up during a traffic spike. This would flow uncapped into divideReplicas, over-allocating pods.

Agree this must be addressed before merge.

Concern 2 — Annotation Overwrite: CORRECT, MEDIUM priority

The new rs.Annotations = typed.Annotations line in plan/common.go (the diff shows it added after the replicas logic block) replaces the entire annotation map. Picchu owns these ReplicaSets directly (not via Deployments), so in practice it's unlikely other controllers add annotations — but it's still a correctness issue and fragile.

However, there's an important entanglement your colleague didn't flag (see gaps below).

Concern 3 — Stale Status: CORRECT but inherent to the reconciliation model, LOW priority

This is a fundamental property of all controller-runtime operators. Status always lags by one observe cycle. The PR correctly mirrors CanRampTo's logic, so they'll be stale in the same way. Self-corrects in 1-2 cycles. The logging recommendation is sensible but not blocking.

Concern 4 — Scale.Default Behavior Change: CORRECT, intentional, LOW priority

The defaulting change at api/v1alpha1/source_defaults.go:109-113 is the intended fix for stuck ramps. The colleague correctly identifies that first canaries will start with more replicas. This is the point. Documenting it in release notes is the right call.

Concern 5 — HPA Recreation Race: CORRECT analysis, self-mitigating

Your colleague actually resolves their own concern in the second paragraph. I confirmed: plan.All at plan/composite.go:25-39 executes sequentially (SyncRevision first, then ScaleRevision). When transitioning to released, SyncRevision removes the ramping annotation and ScaleRevision creates the HPA — both in the same reconcile. Between these two calls, the existing replica count is preserved by the common.go logic (when ramping=false, replicas only update to/from zero). No window of uncontrolled replicas.

Concern 6 — genScalePlan Condition Ordering: CORRECT, LOW priority

A released revision at CurrentPercent == 0 hits the min = max branch (pinned HPA). It never reaches the new released branch. This is correct behavior: a released revision at 0% traffic is being retired, and pinning prevents scaling oscillation during wind-down.

Concern 7 — Scale.Min=0 Fallback: OVERSTATED

The colleague says first canary "could get 100 replicas" but misses that divideReplicas scales down by traffic percentage and cluster count. With Scale.Max=100, canary at 5%, 4 clusters: 100 * 5/100 / 4 = ~2 replicas per cluster. That's reasonable. The concern is directionally valid (more replicas than Scale.Default would give), but the actual impact is modest.

Concern 8 — Multi-Cluster Heterogeneous Scaling: CORRECT but pre-existing

This is not introduced by the PR. divideReplicas has always divided evenly by cluster count. The PR's getBaseCapacityForRamp mirrors CanRampTo and is consistent with the existing pattern.


Part 2: Gaps the Colleague Missed

Gap A (HIGH) — Annotation overwrite and ramping cleanup are entangled

Concerns #2 and the ramping annotation lifecycle are coupled. Right now, the rs.Annotations = typed.Annotations overwrite is what removes the ramping annotation when Ramping=false — the annotation simply isn't in the new map, so it gets wiped.

If you fix concern #2 (merge annotations instead of replace), you must also explicitly delete AnnotationRamping when Ramping=false. Otherwise a previously-ramping ReplicaSet keeps the annotation forever, and common.go will keep force-updating replicas even after the HPA is recreated. This could fight the HPA.

Gap B (MEDIUM) — Failure state transitions while autoscaler is deleted

During canarying, the HPA/KEDA/WPA is deleted. If the revision then fails (SLO alert, manual rollback) and transitions to failing:

  • Ramping becomes false (not canarying/releasing)
  • ScaleRevision will attempt to recreate the HPA on the next reconcile
  • But there's a window (one reconcile cycle) where no autoscaler exists AND no ramping control is active

The replicas themselves are preserved (common.go's to/from-zero logic), so this isn't a capacity emergency. But it means the revision is running with a fixed replica count until the HPA is recreated and stabilizes. For a revision that just failed its canary, this is likely fine (it's about to lose traffic), but it's worth considering whether the failure path should proactively clean up.

Gap C (MEDIUM) — Duplicate Ramping computation, no single source of truth

Both SyncRevision.Ramping and ScaleRevision.Ramping are independently computed from the same expression in sync():

Ramping: i.target().Scale.HasAutoscaler() && (i.status.State.Current == "canarying" || i.status.State.Current == "releasing")

This appears on lines ~463 and ~585 of the PR diff. If these ever diverge due to a future code change, SyncRevision could set the ramping annotation while ScaleRevision doesn't delete the HPA (or vice versa). A single computed isRamping variable passed to both would be safer.

Gap D (LOW) — No test for the failure-during-ramp path

The PR has good test coverage for the happy paths (ramping states, released states, WPA). But there's no test for what happens when a revision in canarying (Ramping=true, HPA deleted) transitions to failing. Does the next reconcile correctly recreate the HPA and stop ramping?


Summary / Recommendations

Must fix before merge:

  1. Add the Scale.Max cap described in PR Fix 4b (controllers/incarnation.go)
  2. Be deliberate about annotation handling: if you switch to merge-based annotation updates, also explicitly remove AnnotationRamping when Ramping=false

Should fix or explicitly document:
3. Extract a single isRamping bool in sync() and pass it to both plans
4. Add a test for the canarying → failing transition to confirm HPA recreation
5. Document the Scale.Default behavior change in release notes

Acceptable risks (no action needed):

  • Stale status lag (inherent to controller-runtime)
  • Even cluster division (pre-existing behavior)
  • Scale.Min=0 fallback (mitigated by divideReplicas)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants