Unexpected periodical istiod scale down
vitalii-buchyn-exa opened this issue · 15 comments
Describe the bug
Hello community,
We are experiencing periodical unexpected istiod deployment scale down that is far below an HPA recommendations and below a minimum replicas count value.
It happens every 12 hour in our case.
That behaviour leads to a control plane traffic amount increase and thus for an outbound traffic cost' increase for us.
If we stop an operator (scale its deployment down to 0 replicas) we don't see such a behaviour.
Example of replicas spec
replicas:
count: 30
max: 160
min: 30
Please also find an example screenshot attached.
Please let me know if any additional info is required.
Thank you!
Expected behavior
To not have such a drastic scale down. Rollout restart can be used instead.
Additional context
Platform 1.23.16-gke.1400
Operator version v2.16.1
Pilot version 1.15.3
Hi @vitalii-buchyn-exa. Is there something interesting in the operator's logs during these events?
@LuciferInLove Hi, nothing i can spot at least, here is couple minutes before and after the issue:
some-istio-operator.log
for istiod deployment we see events like:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 69s (x7 over 4d) deployment-controller Scaled down replica set istiod-istio2-8696758cfb to 54
Normal ScalingReplicaSet 69s deployment-controller Scaled up replica set istiod-istio2-65745fd459 to 71
Normal ScalingReplicaSet 53s (x6 over 3d) deployment-controller Scaled down replica set istiod-istio2-8696758cfb to 51
Normal ScalingReplicaSet 53s (x5 over 2d12h) deployment-controller Scaled down replica set istiod-istio2-8696758cfb to 53
Normal ScalingReplicaSet 52s (x7 over 3d) deployment-controller Scaled down replica set istiod-istio2-8696758cfb to 50
Normal ScalingReplicaSet 52s (x3 over 3d) deployment-controller Scaled down replica set istiod-istio2-8696758cfb to 46
Normal ScalingReplicaSet 52s (x4 over 3d) deployment-controller Scaled down replica set istiod-istio2-8696758cfb to 45
Normal ScalingReplicaSet 52s (x3 over 3d) deployment-controller Scaled down replica set istiod-istio2-8696758cfb to 44
Normal ScalingReplicaSet 52s deployment-controller Scaled down replica set istiod-istio2-8696758cfb to 41
Normal ScalingReplicaSet 51s (x123 over 4d12h) deployment-controller (combined from similar events): Scaled down replica set istiod-istio2-65745fd459 to 0
Thanks for the information. I'm trying to reproduce it but with no success yet. Did you notice any changes in resources? For example, was the HPA resource changed maybe?
HPA config doesn't seem to change: min and max values stay the same.
HPA may get confused for a while with unknown
in TARGETS, but from what we can observe, after HPA can get target metrics, it tries to set replicas to at least minimum value (desired replicas is a yellow line on the screenshot above).
We tried to scale down operator just before the issue and then scale it back after while, the result is postponed istiod deployment scale down, example:
We also tried to scale operator down/up at a random time, but it doesn't behave like this randomly. It seems something happens every 12 hours (on a screenshot timezone is GMT+3)
If it's possible, please try to run the operator with -devel-mode
and -verbose
flags for verbose logging.
sure, --devel-mode
and --verbose
added
will wait till the next downscale event and share logs
here is the log of 5 mins before and 5 mins after the issue (timezone of logs is GMT)
some-debug-istio-operator.log
The most interesting part:
2023-04-17T23:55:20.746Z INFO controllers.IstioMeshGateway gateway address has changed, trigger reconciler by requeuing {"istiomeshgateway": "istio-system/ui-ingressgateway"}
Is the gateway address changed every 12 hours?
i can see a lot of such events in the log
% grep "controllers.IstioMeshGateway gateway address has changed" istio-operator-67db7ccb48-9z2pp.log | wc -l
237
events are like:
2023-04-18T08:51:40.263Z INFO controllers.IstioMeshGateway gateway address has changed, trigger reconciler by requeuing {"istiomeshgateway": "istio-system/ui-ingressgateway"}
2023-04-18T09:08:14.282Z INFO controllers.IstioMeshGateway gateway address has changed, trigger reconciler by requeuing {"istiomeshgateway": "istio-system/ingress-apigee"}
2023-04-18T09:08:20.065Z INFO controllers.IstioMeshGateway gateway address has changed, trigger reconciler by requeuing {"istiomeshgateway": "istio-system/ingress-api"}
2023-04-18T09:08:25.864Z INFO controllers.IstioMeshGateway gateway address has changed, trigger reconciler by requeuing {"istiomeshgateway": "istio-system/ui-ingressgateway"}
2023-04-18T09:24:59.887Z INFO controllers.IstioMeshGateway gateway address has changed, trigger reconciler by requeuing {"istiomeshgateway": "istio-system/ingress-apigee"}
2023-04-18T09:25:05.662Z INFO controllers.IstioMeshGateway gateway address has changed, trigger reconciler by requeuing {"istiomeshgateway": "istio-system/ingress-api"}
2023-04-18T09:25:11.464Z INFO controllers.IstioMeshGateway gateway address has changed, trigger reconciler by requeuing {"istiomeshgateway": "istio-system/ui-ingressgateway"}
2023-04-18T09:41:45.482Z INFO controllers.IstioMeshGateway gateway address has changed, trigger reconciler by requeuing {"istiomeshgateway": "istio-system/ingress-apigee"}
2023-04-18T09:41:51.252Z INFO controllers.IstioMeshGateway gateway address has changed, trigger reconciler by requeuing {"istiomeshgateway": "istio-system/ingress-api"}
2023-04-18T09:41:57.051Z INFO controllers.IstioMeshGateway gateway address has changed, trigger reconciler by requeuing {"istiomeshgateway": "istio-system/ui-ingressgateway"}
shouldn't be there any connection between istiod deployment and istiomeshgateway deployments reconciling?
istiomeshgateway
IP change shouldn't cause istiod
deployment scaling down. I checked that on the same cluster. Only IstioControlPlane changes can cause this behavior. Could something be changed every 12 hours on your cluster that causes these resource changes? Could you please also share an IstioControlPlane that you use?
Could something be changed every 12 hours on your cluster that causes these resource changes?
and something jwt token related
I wish i knew... I can only findlogWindowDuration
Here is an example of IstioControlPlane CR:
istiocontrolplane.yaml.zip
These settings can't affect such behavior. Otherwise, it would be reproducible everywhere, including our test envs.
Unfortunately, there are no clues in the IstioControlPlane. I tried this ICP with some modifications according to my env, and there were no downscales for 16 hours.
Another interesting string in the logs:
2023-04-18T00:00:20.039Z DEBUG controllers.IstioControlPlane.discovery resource diff
{"name": "istiod-istio2", "namespace": "istio-system", "apiVersion": "apps/v1", "kind": "Deployment",
"patch": "{\"spec\":{\"template\":{\"metadata\":{\"labels\":{\"exa_service\":\"ecp-inf-istio\"}}}}}"}
Could something be causing these periodic label changes?
@LuciferInLove you are genius, Sir, there is indeed label injector cronjob with schedule 0 */12 * * *
, that job modifies istiod labels...
Just wonder if such a change could trigger rollout restart instead of full scale down?
Thank you for your time and help!
@vitalii-buchyn-exa no problem. Feel free to reach out.
We'll take a look at what is possible to do. You can try to change PodDisruptionBudget settings for now. Maybe the tips from here will come in handy as well:
kubernetes/kubernetes#72231