HPA scales up stable replicas when doing canary deployment with Argo Rollouts
Opened this issue · 3 comments
We use canary deployments through Argo Rollouts to deploy our services. For services that utilize the Kubernetes Horizontal Pod Autoscaler (HPA) with CPU-based scaling, we observe the stable ReplicaSet scaling up during each deployment and then scaling back down after the deployment completes.
However, when reviewing the metrics for the service using both kubectl describe hpa and kubectl get hpa during these scale-ups, the reported metrics never exceed the configured threshold.
Reference: Rollout/abc-app Target CPU utilization: 60% Current CPU utilization: 4% Min replicas: 40 Max replicas: 120 Rollout pods: 94 current / 94 desired
`
Events:
Type Reason Age From Message
Normal SuccessfulRescale 44m (x2 over 23h) horizontal-pod-autoscaler New size: 86; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 44m (x2 over 23h) horizontal-pod-autoscaler New size: 87; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 44m (x2 over 23h) horizontal-pod-autoscaler New size: 88; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 44m (x2 over 23h) horizontal-pod-autoscaler New size: 89; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 43m horizontal-pod-autoscaler New size: 90; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 43m horizontal-pod-autoscaler New size: 91; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 37m (x73 over 4d23h) horizontal-pod-autoscaler (combined from similar events): New size: 91; reason: All metrics below target
Normal SuccessfulRescale 13m (x6 over 42m) horizontal-pod-autoscaler New size: 94; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 8m41s (x8 over 23h) horizontal-pod-autoscaler New size: 92; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 8m26s (x8 over 23h) horizontal-pod-autoscaler New size: 93; reason: external metric s3-cron-Asia-Kolkata-0016xxx-0000xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: abc-app,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
Normal SuccessfulRescale 3m9s (x6 over 32m) horizontal-pod-autoscaler New size: 91; reason: All metrics below target
`
Keda version - 2.13.2
argo-rollout - v1.2.1
k8s - 1.28
FWIW we've observed this in our infrastructure as well. In this instance a service was configured with ArgoRollouts with three steps: 1%, 10% and 65% of traffic. When traffic was increased from 10% to 65%, we saw a huge spike in both the canary replica and the stable replica.
Here's the steps as they show on Rollout dashboard:
At this point the stable replica set scaled up to 42 replicas, which is the maximum allow by the HPA, while the canary scaled up to 28 (which is 65% of 42, rounded up; I assume ArgoRollout is using a fraction of either the current number of replicas or the total number of replicas * the traffic percentage)
Inspecting the HPA we can see that cpu and memory (it scales on both) is below desired. It stayed like that for hours, all throughout the Canary, and never scaled down.
Reference: Rollout/<redacted>
Metrics: ( current / target )
resource cpu of container "<redacted>" on pods (as a percentage of request): 4% (41m) / 100%
resource memory of container "<redacted>" on pods (as a percentage of request): 42% (1834665691428m) / 60%
Min replicas: 10
Max replicas: 42
Rollout pods: 42 current / 42 desired
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True ReadyForNewScale recommended size matches current size
ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from memory container resource utilization (percentage of request)
ScalingLimited True TooManyReplicas the desired replica count is more than the maximum replica count
Events: <none>
This is how it looks on Grafana, at about 13h30 when the traffic shift went from 10% to 65% we saw the huge spike in our replicas, in both the stable and canary, going from a total of 22 to 70.
ArgoRollouts v1.7.1
ArgoCD v2.9.1+58b04e5
Kubernetes v1.29.4