banzaicloud/istio-operator

Unexpected periodical istiod scale down

vitalii-buchyn-exa opened this issue · 15 comments

Describe the bug
Hello community,

We are experiencing periodical unexpected istiod deployment scale down that is far below an HPA recommendations and below a minimum replicas count value.
It happens every 12 hour in our case.

That behaviour leads to a control plane traffic amount increase and thus for an outbound traffic cost' increase for us.

If we stop an operator (scale its deployment down to 0 replicas) we don't see such a behaviour.

Example of replicas spec

replicas:
          count: 30
          max: 160
          min: 30

Please also find an example screenshot attached.

Please let me know if any additional info is required.

Thank you!

Expected behavior
To not have such a drastic scale down. Rollout restart can be used instead.

Screenshots
image

Additional context
Platform 1.23.16-gke.1400
Operator version v2.16.1
Pilot version 1.15.3

Hi @vitalii-buchyn-exa. Is there something interesting in the operator's logs during these events?

@LuciferInLove Hi, nothing i can spot at least, here is couple minutes before and after the issue:
some-istio-operator.log

for istiod deployment we see events like:

Events:
  Type    Reason             Age                    From                   Message
  ----    ------             ----                   ----                   -------
  Normal  ScalingReplicaSet  69s (x7 over 4d)       deployment-controller  Scaled down replica set istiod-istio2-8696758cfb to 54
  Normal  ScalingReplicaSet  69s                    deployment-controller  Scaled up replica set istiod-istio2-65745fd459 to 71
  Normal  ScalingReplicaSet  53s (x6 over 3d)       deployment-controller  Scaled down replica set istiod-istio2-8696758cfb to 51
  Normal  ScalingReplicaSet  53s (x5 over 2d12h)    deployment-controller  Scaled down replica set istiod-istio2-8696758cfb to 53
  Normal  ScalingReplicaSet  52s (x7 over 3d)       deployment-controller  Scaled down replica set istiod-istio2-8696758cfb to 50
  Normal  ScalingReplicaSet  52s (x3 over 3d)       deployment-controller  Scaled down replica set istiod-istio2-8696758cfb to 46
  Normal  ScalingReplicaSet  52s (x4 over 3d)       deployment-controller  Scaled down replica set istiod-istio2-8696758cfb to 45
  Normal  ScalingReplicaSet  52s (x3 over 3d)       deployment-controller  Scaled down replica set istiod-istio2-8696758cfb to 44
  Normal  ScalingReplicaSet  52s                    deployment-controller  Scaled down replica set istiod-istio2-8696758cfb to 41
  Normal  ScalingReplicaSet  51s (x123 over 4d12h)  deployment-controller  (combined from similar events): Scaled down replica set istiod-istio2-65745fd459 to 0

Thanks for the information. I'm trying to reproduce it but with no success yet. Did you notice any changes in resources? For example, was the HPA resource changed maybe?

HPA config doesn't seem to change: min and max values stay the same.

HPA may get confused for a while with unknown in TARGETS, but from what we can observe, after HPA can get target metrics, it tries to set replicas to at least minimum value (desired replicas is a yellow line on the screenshot above).

We tried to scale down operator just before the issue and then scale it back after while, the result is postponed istiod deployment scale down, example:
image

We also tried to scale operator down/up at a random time, but it doesn't behave like this randomly. It seems something happens every 12 hours (on a screenshot timezone is GMT+3)

If it's possible, please try to run the operator with -devel-mode and -verbose flags for verbose logging.

sure, --devel-mode and --verbose added
will wait till the next downscale event and share logs

here is the log of 5 mins before and 5 mins after the issue (timezone of logs is GMT)
some-debug-istio-operator.log

scale down happened at 03:00:30 GMT+3
image

The most interesting part:

2023-04-17T23:55:20.746Z	INFO	controllers.IstioMeshGateway	gateway address has changed, trigger reconciler by requeuing	{"istiomeshgateway": "istio-system/ui-ingressgateway"}

Is the gateway address changed every 12 hours?

i can see a lot of such events in the log

% grep "controllers.IstioMeshGateway      gateway address has changed" istio-operator-67db7ccb48-9z2pp.log | wc -l
     237

events are like:

2023-04-18T08:51:40.263Z	INFO	controllers.IstioMeshGateway	gateway address has changed, trigger reconciler by requeuing	{"istiomeshgateway": "istio-system/ui-ingressgateway"}
2023-04-18T09:08:14.282Z	INFO	controllers.IstioMeshGateway	gateway address has changed, trigger reconciler by requeuing	{"istiomeshgateway": "istio-system/ingress-apigee"}
2023-04-18T09:08:20.065Z	INFO	controllers.IstioMeshGateway	gateway address has changed, trigger reconciler by requeuing	{"istiomeshgateway": "istio-system/ingress-api"}
2023-04-18T09:08:25.864Z	INFO	controllers.IstioMeshGateway	gateway address has changed, trigger reconciler by requeuing	{"istiomeshgateway": "istio-system/ui-ingressgateway"}
2023-04-18T09:24:59.887Z	INFO	controllers.IstioMeshGateway	gateway address has changed, trigger reconciler by requeuing	{"istiomeshgateway": "istio-system/ingress-apigee"}
2023-04-18T09:25:05.662Z	INFO	controllers.IstioMeshGateway	gateway address has changed, trigger reconciler by requeuing	{"istiomeshgateway": "istio-system/ingress-api"}
2023-04-18T09:25:11.464Z	INFO	controllers.IstioMeshGateway	gateway address has changed, trigger reconciler by requeuing	{"istiomeshgateway": "istio-system/ui-ingressgateway"}
2023-04-18T09:41:45.482Z	INFO	controllers.IstioMeshGateway	gateway address has changed, trigger reconciler by requeuing	{"istiomeshgateway": "istio-system/ingress-apigee"}
2023-04-18T09:41:51.252Z	INFO	controllers.IstioMeshGateway	gateway address has changed, trigger reconciler by requeuing	{"istiomeshgateway": "istio-system/ingress-api"}
2023-04-18T09:41:57.051Z	INFO	controllers.IstioMeshGateway	gateway address has changed, trigger reconciler by requeuing	{"istiomeshgateway": "istio-system/ui-ingressgateway"}

shouldn't be there any connection between istiod deployment and istiomeshgateway deployments reconciling?

istiomeshgateway IP change shouldn't cause istiod deployment scaling down. I checked that on the same cluster. Only IstioControlPlane changes can cause this behavior. Could something be changed every 12 hours on your cluster that causes these resource changes? Could you please also share an IstioControlPlane that you use?

Could something be changed every 12 hours on your cluster that causes these resource changes?
I wish i knew... I can only find logWindowDuration

and something jwt token related

Here is an example of IstioControlPlane CR:
istiocontrolplane.yaml.zip

These settings can't affect such behavior. Otherwise, it would be reproducible everywhere, including our test envs.

Unfortunately, there are no clues in the IstioControlPlane. I tried this ICP with some modifications according to my env, and there were no downscales for 16 hours.

Another interesting string in the logs:

2023-04-18T00:00:20.039Z	DEBUG	controllers.IstioControlPlane.discovery	resource diff
{"name": "istiod-istio2", "namespace": "istio-system", "apiVersion": "apps/v1", "kind": "Deployment",
"patch": "{\"spec\":{\"template\":{\"metadata\":{\"labels\":{\"exa_service\":\"ecp-inf-istio\"}}}}}"}

Could something be causing these periodic label changes?

@LuciferInLove you are genius, Sir, there is indeed label injector cronjob with schedule 0 */12 * * *, that job modifies istiod labels...
Just wonder if such a change could trigger rollout restart instead of full scale down?

Thank you for your time and help!

@vitalii-buchyn-exa no problem. Feel free to reach out.
We'll take a look at what is possible to do. You can try to change PodDisruptionBudget settings for now. Maybe the tips from here will come in handy as well:
kubernetes/kubernetes#72231