During the Canary deployment, the Stable ReplicaSet temporarily drops to zero and then recovers, causing brief downtime.
y0ngha opened this issue · 0 comments
Checklist:
- I've included steps to reproduce the bug.
- I've included the version of argo rollouts.
Describe the bug
When syncing the Rollout Object during a Canary deployment, the Stable ReplicaSet changes to 0.
To Reproduce
- Please set the AutoSync option in ArgoCD to False.
- Set up the yaml file that ArgoCD subscribes to.
- Write spec.replicas as 2 in the Rollout Object.
- Create an HPA and set the service's minReplicaCount to 3.
: In the ArgoCD service Manifest, set RespectIgnoreDifferences to False and do not set any ignoreDifferences. - Change a specific setting in the yaml file set in step 2 and upload the changes.
: At this time, the spec.replicas in the yaml file is 2. - Press Sync in ArgoCD. (prune: false, replace: false, force: false)
A problem occurs.
Expected behavior
I expected that the original Stable Replica would remain unaffected (or revert back to 2 as defined in the original yaml).
Screenshots
Version
quay.io/argoproj/argo-rollouts:v1.6.5
Logs
time="2024-04-26T07:39:20Z" level=info msg="Started syncing rollout" generation=655 namespace=passorder resourceVersion=151340900 rollout=point-service-production
time="2024-04-26T07:39:20Z" level=error msg="The Rollout \"point-service-production\" is invalid: spec.strategy.canary.trafficRouting.istio.virtualServices.name: Invalid value: \"point-service-production-virtualservice\": Istio VirtualService has invalid HTTP routes. Error: HTTP Route 'weighted' is not found in the defined Virtual Service." namespace=passorder rollout=point-service-production
time="2024-04-26T07:39:20Z" level=info msg="Reconciliation completed" generation=655 namespace=passorder resourceVersion=151340900 rollout=point-service-production time_ms=2.29211
time="2024-04-26T07:39:20Z" level=info msg="Started syncing rollout" generation=655 namespace=passorder resourceVersion=151340900 rollout=point-service-production
time="2024-04-26T07:39:20Z" level=info msg="Syncing replicas only due to scaling event" namespace=passorder rollout=point-service-production
time="2024-04-26T07:39:20Z" level=info msg="Enqueueing parent of passorder/point-service-production-7b74795c68: Rollout passorder/point-service-production"
time="2024-04-26T07:39:20Z" level=info msg="Event(v1.ObjectReference{Kind:\"Rollout\", Namespace:\"passorder\", Name:\"point-service-production\", UID:\"6e46b387-6798-4cdd-aa0a-3c3efeb78306\", APIVersion:\"argoproj.io/v1alpha1\", ResourceVersion:\"151340900\", FieldPath:\"\"}): type: 'Normal' reason: 'ScalingReplicaSet' Scaled down ReplicaSet point-service-production-7b74795c68 (revision 11) from 3 to 0"
time="2024-04-26T07:39:20Z" level=info msg="Scaled down ReplicaSet point-service-production-7b74795c68 (revision 11) from 3 to 0" event_reason=ScalingReplicaSet namespace=passorder rollout=point-service-production
time="2024-04-26T07:39:20Z" level=info msg="Not finished reconciling stableRS" namespace=passorder rollout=point-service-production
time="2024-04-26T07:39:20Z" level=info msg="No status changes. Skipping patch" generation=655 namespace=passorder resourceVersion=151340900 rollout=point-service-production
time="2024-04-26T07:39:20Z" level=info msg="Reconciliation completed" generation=655 namespace=passorder resourceVersion=151340900 rollout=point-service-production time_ms=19.385227
time="2024-04-26T07:39:20Z" level=info msg="Started syncing rollout" generation=655 namespace=passorder resourceVersion=151340900 rollout=point-service-production
time="2024-04-26T07:39:20Z" level=info msg="Canary steps change detected (new: 84575ff995, old: 6c94bfbdd6)" namespace=passorder rollout=point-service-production
time="2024-04-26T07:39:20Z" level=info msg="Assuming 6dd487bcb9 for new replicaset pod hash" namespace=passorder rollout=point-service-production
time="2024-04-26T07:39:20Z" level=info msg="Canary steps change detected (new: 84575ff995, old: 6c94bfbdd6)" namespace=passorder rollout=point-service-production
time="2024-04-26T07:39:20Z" level=info msg="Rollout not completed, started update to revision 12 (6dd487bcb9)" event_reason=RolloutNotCompleted namespace=passorder rollout=point-service-production
time="2024-04-26T07:39:20Z" level=info msg="Event(v1.ObjectReference{Kind:\"Rollout\", Namespace:\"passorder\", Name:\"point-service-production\", UID:\"6e46b387-6798-4cdd-aa0a-3c3efeb78306\", APIVersion:\"argoproj.io/v1alpha1\", ResourceVersion:\"151340900\", FieldPath:\"\"}): type: 'Normal' reason: 'RolloutNotCompleted' Rollout not completed, started update to revision 12 (6dd487bcb9)"
time="2024-04-26T07:39:20Z" level=info msg="Patched: {\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2024-04-02T12:02:32Z\",\"lastUpdateTime\":\"2024-04-02T12:02:32Z\",\"message\":\"Rollout is paused\",\"reason\":\"RolloutPaused\",\"status\":\"False\",\"type\":\"Paused\"},{\"lastTransitionTime\":\"2024-04-26T05:45:47Z\",\"lastUpdateTime\":\"2024-04-26T05:45:47Z\",\"message\":\"Rollout has minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"True\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2024-04-26T07:39:20Z\",\"lastUpdateTime\":\"2024-04-26T07:39:20Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2024-04-02T12:02:32Z\",\"lastUpdateTime\":\"2024-04-26T07:39:20Z\",\"message\":\"Rollout \\\"point-service-production\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"},{\"lastTransitionTime\":\"2024-04-26T07:39:20Z\",\"lastUpdateTime\":\"2024-04-26T07:39:20Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"False\",\"type\":\"Completed\"}],\"currentPodHash\":\"6dd487bcb9\",\"currentStepHash\":\"84575ff995\",\"currentStepIndex\":0,\"message\":\"more replicas need to be updated\",\"phase\":\"Progressing\",\"updatedReplicas\":null}}" generation=655 namespace=passorder resourceVersion=151340900 rollout=point-service-production
time="2024-04-26T07:39:20Z" level=info msg="persisted to informer" generation=655 namespace=passorder resourceVersion=151340905 rollout=point-service-production
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
The current solution
When performing a Sync in the ArgoCD service, I set RespectIgnoreDifferences to True and defined the following values in ignoreDifferences:
ignoreDifferences:
- group: apps
kind: Deployment
jsonPointers:
- /spec/replicas
- group: argoproj.io
kind: Rollout
jsonPointers:
- /spec/replicas
- group: autoscaling
kind: HorizontalPodAutoscaler
jsonPointers:
- /spec/minReplicas
- /spec/maxReplicas
This eliminates any service downtime.
Ref
Github Issue: #3543
Slack: https://cloud-native.slack.com/archives/C01U781DW2E/p1714353130134499