Deploymets scaled down very quickly despite low maxDisruption

Question

Deploymets scaled down very quickly despite low maxDisruption

michael-careplanner opened this issue 3 years ago · 20 comments

michael-careplanner commented 3 years ago

We have several deployments scaled by WPA on fairly spiky queues. In order to limit how spiky our scale downs are we set maxDisruption to a low value.

However we still see scale downs happening very quicky. Looking at the logs, it seems WPA is taking scaling decisions several times a second. For example in once instance, despite having a maxDisruption set to 1% we saw a deployment scale from 53 replicas to 30 replicas in 5 seconds. Each time the deployment was decreased by 1 replica (so obeying the maxDisruption value set) but WPA made 26 individual scaling decisions in those 5 seconds.

I think a (possibly configurable) cooldown needs adding after a scaling decision, so that fast scale downs can be prevented by setting a low maxDisruption.

Answer 1 · 2022-01-22T09:44:47.000Z

Specifying the secondsToProcessOneJob will help here.

WPA will keep your minReplicas based on the RPM on the queue if secondsToProcessOneJob is specified.

Basically this is the logic, code:

workersBasedOnMessagesSent := int32(math.Ceil((secondsToProcessOneJob * messagesSentPerMinute) / 60))
if workersBasedOnMessagesSent > minWorkers {
return workersBasedOnMessagesSent
}

return minWorkers

messageSentPerMinute in SQS is avg number of msg sent in 1 minute, code.
minWorkers is the replicas you have specified as min in WPA spec. kubectl edit wpa
secondsToProcessOneJob is what you specify

Can you try this and let us know if this helps?

Answer 2 · 2022-02-03T21:17:01.000Z

@michael-careplanner I think the resync-period configuration is set to a very low value(like 200ms) which would explain the rapid scaling decisions. The default value is 20 seconds which works well for us. Please confirm if this is the case and if so is there any particular reason for setting it to a value lower than 20 seconds? Scaling up from 0 is anyways instantaneous due to the long-polling logic and I can't figure out another use-case for setting a small value for resync-period.

Answer 3 · 2022-02-04T08:19:14.000Z

@justjkk Looking at our config resync-period is actually being set to 40, up from the default of 20. Agreed that having it set to a very low value doesn't make too much sense, but looks like that isn't the case in our config.

@alok87 Thanks for your reply. I am planning to look at secondsToProcessOneJob at some point, but I feel like that's a separate issue to this. I still don't think WPA should be making scale down decisions so close together, especially given the resync period we've set.

Answer 4 · 2022-02-04T08:23:18.000Z

Possible to share the log the WPA for a particular queue with -v=4 verbosity? will help in debugging.

Answer 5 · 2022-02-04T08:54:40.000Z

wpa-rapid-scale-down.log

Not sure if this is a large enough snippet of logs but it does show the rapid scale down.

Answer 6 · 2022-02-04T09:41:57.000Z

Resync-period is not being respected. I see the control loop executing every 200ms, instead of 20second.

I0204 08:25:22.941872       1 controller.go:441] cpapi_live_webhooks qMsgs: 89, desired: 31
I0204 08:25:22.968042       1 controller.go:441] cpapi_live_webhooks qMsgs: 89, desired: 30
I0204 08:25:23.148417       1 controller.go:441] cpapi_live_webhooks qMsgs: 89, desired: 29
I0204 08:25:23.350866       1 controller.go:441] cpapi_live_webhooks qMsgs: 89, desired: 28

I think event handlers needs to use AddEventHandlerWithResyncPeriod
wpa is using AddEventHandler

Made a change.

Can you try this image: practodev/workerpodautoscaler:v1.5.0-5-gfef9160 and send me the logs again if possible. Wanted to test if this fixes the problem!

Answer 7 · 2022-02-04T11:28:41.000Z

Tested using that and I am seeing some 40 seconds pauses (in line with our resync-period) - However I'm also seeing every 200ms scale down's for certain deployments:

I0204 11:18:58.042487       1 controller.go:451] cpapi_live_webhooks qMsgs: 153, desired: 61                                                                                                
I0204 11:18:58.243314       1 controller.go:451] cpapi_live_webhooks qMsgs: 153, desired: 60                                                                                                
I0204 11:18:58.442599       1 controller.go:451] cpapi_live_webhooks qMsgs: 153, desired: 59                                                                                                
I0204 11:18:58.642624       1 controller.go:451] cpapi_live_webhooks qMsgs: 153, desired: 58                                                                                                
I0204 11:18:58.842958       1 controller.go:451] cpapi_live_webhooks qMsgs: 153, desired: 57
<many more lines of this>

Seems once it has decided to do a scale down it still loops in the controller without respecting the resync-period until the deployment is at the size WPA wants it to be at.

Answer 8 · 2022-02-04T12:01:06.000Z

Ok I think since we are updating the status in WPA every reconciliation loop, hence it is creating new event everytime without respecting resync.

Doing WPA status update only when changes happen and doing noop events can prevent this.

Answer 9 · 2022-02-05T04:29:49.000Z

@michael-careplanner can you confirm the 200ms is happening in the queues/wpa objects with else:

         if workerPodAutoScaler.Status.CurrentReplicas == currentWorkers &&
		  workerPodAutoScaler.Status.AvailableReplicas == availableWorkers &&
		  workerPodAutoScaler.Status.DesiredReplicas == desiredWorkers &&
		  workerPodAutoScaler.Status.CurrentMessages == queueMessages {
		  klog.V(4).Infof("%s/%s: WPA status is already up to date\n", namespace, name)
		  return
	  } else {
		  klog.V(4).Infof("%s/%s: Updating wpa status\n", namespace, name)
	  }

Answer 10 · 2022-02-08T08:48:34.000Z

Yes, I see lots of messages like this while it's doing the 200ms reconciliation loop:
controller.go:770] live-api/cpapi-transport: Updating wpa status

Followed by a single controller.go:767] live-api/cpapi-transport: WPA status is already up to date at the end of the fast loop. So looks like it's repeatedly executing that if statement above until it hits the truthy case, then it roughly respects the resync-period before running again:

I0208 08:49:18.857614       1 controller.go:767] live-api/cpapi-transport: WPA status is already up to date
I0208 08:49:53.650846       1 controller.go:626] cpapi_live_transport min=1, max=300, targetBacklog=10

Answer 11 · 2022-02-10T06:46:09.000Z

Yes then it is confirmed, the update status is leading to very fast re-queue. Cant stop doing updates to the status since queue messages and others details are updated with every update, also the dashboards are there on top of these status.

We may have to consider scale down cool-off after all: --scale-down-delay-after-scale-up

Answer 12 · 2022-02-10T10:27:25.000Z

Hello @michael-careplanner

Added scale-down-delay-after-last-scale-activity.

--scale-down-delay-after-last-scale-activity 
scale down delay after last scale up or down (defaults to 600 seconds i.e 10mins)

I have written test cases for it. But have not tested it in any Kubernetes cluster yet.

Here is the image: practodev/workerpodautoscaler:v1.5.0-9-gbeb0731

Answer 13 · 2022-02-15T11:43:19.000Z

Just tested this and hitting an issue where it returns scale-up when scaling down and therefore not respecting the cool down:

I0215 11:25:32.221965       1 controller.go:458] cpapi_live_transport current: 96                                                                                                           I0215 11:25:32.221969       1 controller.go:459] cpapi_live_transport qMsgs: 98, desired: 95                                                                                                I0215 11:25:32.240002       1 controller.go:519] cpapi_live_transport scaleOp: scale-up

Looks like GetScaleOperation expects desiredWorkers then currentWorkers, but when it's called it's passing currentWorkers then desiredWorkers, so the GetScaleOperation logic is inverted!

Answer 14 · 2022-02-15T17:06:21.000Z

@michael-careplanner Fixed it here.

New image with the above change:
practodev/workerpodautoscaler:v1.5.0-11-g7d12579

Answer 15 · 2022-02-16T10:12:37.000Z

Tested that version, it now correctly infers the scale type, but it's still not respecting the cooldown - log:

I0216 09:48:55.389034       1 controller.go:660] cpapi_live_transport min=1, max=300, targetBacklog=10
I0216 09:48:55.389049       1 controller.go:683] cpapi_live_transport qMsgs=55, qMsgsPerMin=-1
I0216 09:48:55.389057       1 controller.go:685] cpapi_live_transport secToProcessJob=0, maxDisruption=1%
I0216 09:48:55.389061       1 controller.go:687] cpapi_live_transport current=115, idle=-1
I0216 09:48:55.389066       1 controller.go:689] cpapi_live_transport minComputed=1, maxDisruptable=2
I0216 09:48:55.389073       1 controller.go:458] cpapi_live_transport current: 115
I0216 09:48:55.389079       1 controller.go:459] cpapi_live_transport qMsgs: 55, desired: 113
I0216 09:48:55.389091       1 scale_operation.go:60] cpapi_live_transport scaleDown is allowed, cooloff passed
I0216 09:48:55.394090       1 controller.go:519] cpapi_live_transport scaleOp: scale-down
I0216 09:48:55.394109       1 controller.go:806] live-api/cpapi-transport: Updating wpa status
I0216 09:48:55.587844       1 controller.go:827] live-api/cpapi-transport: Updated wpa status
I0216 09:48:55.587944       1 controller.go:660] cpapi_live_transport min=1, max=300, targetBacklog=10
I0216 09:48:55.587958       1 controller.go:683] cpapi_live_transport qMsgs=55, qMsgsPerMin=-1
I0216 09:48:55.587965       1 controller.go:685] cpapi_live_transport secToProcessJob=0, maxDisruption=1%
I0216 09:48:55.587977       1 controller.go:687] cpapi_live_transport current=113, idle=-1
I0216 09:48:55.587982       1 controller.go:689] cpapi_live_transport minComputed=1, maxDisruptable=2
I0216 09:48:55.587988       1 controller.go:458] cpapi_live_transport current: 113
I0216 09:48:55.587994       1 controller.go:459] cpapi_live_transport qMsgs: 55, desired: 111
I0216 09:48:55.588005       1 scale_operation.go:60] cpapi_live_transport scaleDown is allowed, cooloff passed
I0216 09:48:55.597267       1 controller.go:519] cpapi_live_transport scaleOp: scale-down
I0216 09:48:55.597284       1 controller.go:806] live-api/cpapi-transport: Updating wpa status
I0216 09:48:55.788992       1 controller.go:827] live-api/cpapi-transport: Updated wpa status

I've had a good look over your PR but can't see where the bug is...

Answer 16 · 2022-02-16T11:16:43.000Z

I am guessing the status update for lastScaleTime is not reflecting in your WPA object.

status:
  LastScaleTime:

This should get updated after every scale activity. It should start with nil.

I am testing this one of our clusters.

Answer 17 · 2022-02-16T12:34:18.000Z

@michael-careplanner Hey! Issue was scaleDownDelay was being set as 0 due to a bug. I fixed it in f1be3a3.

Thanks for bringing out this issue!! I saw the same problem in one of our high throughput queue. Here is the dashboard image after this change, the scale down becomes smooth 👍🏼 (this worker did not set secondsToProcessOneJob)

If someone still needs the old behaviour, would need to set the scaleDownDelay to a lower value than default of 10mins.

Below image has the fix, let me know how it works out for you!:
practodev/workerpodautoscaler:v1.5.0-12-gf1be3a3

Answer 18 · 2022-02-21T05:53:59.000Z

@michael-careplanner This is good to release from our end. Working perfectly in our latest cluster. Did you get the chance to test it?

Answer 19 · 2022-02-21T08:38:15.000Z

@alok87 Yeah, this is working well on our end now. Thanks very much for all your work on this!

Answer 20 · 2022-02-21T09:24:48.000Z

Release v1.6.0 https://github.com/practo/k8s-worker-pod-autoscaler/releases/tag/v1.6.0