HPA scales up stable replicas when doing canary deployment with Argo Rollouts

Question

HPA scales up stable replicas when doing canary deployment with Argo Rollouts

Opened this issue 3 months ago · 3 comments

We use canary deployments through Argo Rollouts to deploy our services. For services that utilize the Kubernetes Horizontal Pod Autoscaler (HPA) with CPU-based scaling, we observe the stable ReplicaSet scaling up during each deployment and then scaling back down after the deployment completes.

However, when reviewing the metrics for the service using both kubectl describe hpa and kubectl get hpa during these scale-ups, the reported metrics never exceed the configured threshold.

Reference: Rollout/abc-app Target CPU utilization: 60% Current CPU utilization: 4% Min replicas: 40 Max replicas: 120 Rollout pods: 94 current / 94 desired
`
Events:
Type Reason Age From Message

Normal SuccessfulRescale 44m (x2 over 23h) horizontal-pod-autoscaler New size: 86; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 44m (x2 over 23h) horizontal-pod-autoscaler New size: 87; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 44m (x2 over 23h) horizontal-pod-autoscaler New size: 88; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 44m (x2 over 23h) horizontal-pod-autoscaler New size: 89; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 43m horizontal-pod-autoscaler New size: 90; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 43m horizontal-pod-autoscaler New size: 91; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 37m (x73 over 4d23h) horizontal-pod-autoscaler (combined from similar events): New size: 91; reason: All metrics below target
Normal SuccessfulRescale 13m (x6 over 42m) horizontal-pod-autoscaler New size: 94; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 8m41s (x8 over 23h) horizontal-pod-autoscaler New size: 92; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 8m26s (x8 over 23h) horizontal-pod-autoscaler New size: 93; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 3m9s (x6 over 32m) horizontal-pod-autoscaler New size: 91; reason: All metrics below target
`

Keda version - 2.13.2
argo-rollout - v1.2.1
k8s - 1.28

Answer 1 · 2024-10-16T10:05:02.000Z

Were you able to resolve this issue @anup1384?

Answer 2 · 2024-10-16T10:13:34.000Z

Hi @FaLxy
Not yet, still facing issue

Answer 3 · 2024-11-13T18:51:03.000Z

FWIW we've observed this in our infrastructure as well. In this instance a service was configured with ArgoRollouts with three steps: 1%, 10% and 65% of traffic. When traffic was increased from 10% to 65%, we saw a huge spike in both the canary replica and the stable replica.

Here's the steps as they show on Rollout dashboard:

At this point the stable replica set scaled up to 42 replicas, which is the maximum allow by the HPA, while the canary scaled up to 28 (which is 65% of 42, rounded up; I assume ArgoRollout is using a fraction of either the current number of replicas or the total number of replicas * the traffic percentage)

Inspecting the HPA we can see that cpu and memory (it scales on both) is below desired. It stayed like that for hours, all throughout the Canary, and never scaled down.

Reference:                                                                         Rollout/<redacted>
Metrics:                                                                           ( current / target )
  resource cpu of container "<redacted>" on pods  (as a percentage of request):     4% (41m) / 100%
  resource memory of container "<redacted>" on pods  (as a percentage of request):  42% (1834665691428m) / 60%
Min replicas:                                                                      10
Max replicas:                                                                      42
Rollout pods:                                                                      42 current / 42 desired
Conditions:
  Type            Status  Reason            Message
  ----            ------  ------            -------
  AbleToScale     True    ReadyForNewScale  recommended size matches current size
  ScalingActive   True    ValidMetricFound  the HPA was able to successfully calculate a replica count from memory container resource utilization (percentage of request)
  ScalingLimited  True    TooManyReplicas   the desired replica count is more than the maximum replica count
Events:           <none>

This is how it looks on Grafana, at about 13h30 when the traffic shift went from 10% to 65% we saw the huge spike in our replicas, in both the stable and canary, going from a total of 22 to 70.

ArgoRollouts v1.7.1
ArgoCD v2.9.1+58b04e5
Kubernetes v1.29.4