Deploymets scaled down very quickly despite low maxDisruption
michael-careplanner opened this issue ยท 20 comments
We have several deployments scaled by WPA on fairly spiky queues. In order to limit how spiky our scale downs are we set maxDisruption to a low value.
However we still see scale downs happening very quicky. Looking at the logs, it seems WPA is taking scaling decisions several times a second. For example in once instance, despite having a maxDisruption set to 1% we saw a deployment scale from 53 replicas to 30 replicas in 5 seconds. Each time the deployment was decreased by 1 replica (so obeying the maxDisruption value set) but WPA made 26 individual scaling decisions in those 5 seconds.
I think a (possibly configurable) cooldown needs adding after a scaling decision, so that fast scale downs can be prevented by setting a low maxDisruption.
Specifying the secondsToProcessOneJob
will help here.
WPA will keep your minReplicas based on the RPM on the queue if secondsToProcessOneJob is specified.
Basically this is the logic, code:
workersBasedOnMessagesSent := int32(math.Ceil((secondsToProcessOneJob * messagesSentPerMinute) / 60))
if workersBasedOnMessagesSent > minWorkers {
return workersBasedOnMessagesSent
}
return minWorkers
- messageSentPerMinute in SQS is avg number of msg sent in 1 minute, code.
- minWorkers is the replicas you have specified as min in WPA spec.
kubectl edit wpa
- secondsToProcessOneJob is what you specify
Can you try this and let us know if this helps?
@michael-careplanner I think the resync-period
configuration is set to a very low value(like 200ms) which would explain the rapid scaling decisions. The default value is 20 seconds which works well for us. Please confirm if this is the case and if so is there any particular reason for setting it to a value lower than 20 seconds? Scaling up from 0 is anyways instantaneous due to the long-polling logic and I can't figure out another use-case for setting a small value for resync-period
.
@justjkk Looking at our config resync-period is actually being set to 40, up from the default of 20. Agreed that having it set to a very low value doesn't make too much sense, but looks like that isn't the case in our config.
@alok87 Thanks for your reply. I am planning to look at secondsToProcessOneJob
at some point, but I feel like that's a separate issue to this. I still don't think WPA should be making scale down decisions so close together, especially given the resync period we've set.
Possible to share the log the WPA for a particular queue with -v=4
verbosity? will help in debugging.
Not sure if this is a large enough snippet of logs but it does show the rapid scale down.
Resync-period is not being respected. I see the control loop executing every 200ms, instead of 20second.
I0204 08:25:22.941872 1 controller.go:441] cpapi_live_webhooks qMsgs: 89, desired: 31
I0204 08:25:22.968042 1 controller.go:441] cpapi_live_webhooks qMsgs: 89, desired: 30
I0204 08:25:23.148417 1 controller.go:441] cpapi_live_webhooks qMsgs: 89, desired: 29
I0204 08:25:23.350866 1 controller.go:441] cpapi_live_webhooks qMsgs: 89, desired: 28
I think event handlers needs to use AddEventHandlerWithResyncPeriod
wpa is using AddEventHandler
Made a change.
Can you try this image: practodev/workerpodautoscaler:v1.5.0-5-gfef9160
and send me the logs again if possible. Wanted to test if this fixes the problem!
Tested using that and I am seeing some 40 seconds pauses (in line with our resync-period
) - However I'm also seeing every 200ms scale down's for certain deployments:
I0204 11:18:58.042487 1 controller.go:451] cpapi_live_webhooks qMsgs: 153, desired: 61
I0204 11:18:58.243314 1 controller.go:451] cpapi_live_webhooks qMsgs: 153, desired: 60
I0204 11:18:58.442599 1 controller.go:451] cpapi_live_webhooks qMsgs: 153, desired: 59
I0204 11:18:58.642624 1 controller.go:451] cpapi_live_webhooks qMsgs: 153, desired: 58
I0204 11:18:58.842958 1 controller.go:451] cpapi_live_webhooks qMsgs: 153, desired: 57
<many more lines of this>
Seems once it has decided to do a scale down it still loops in the controller without respecting the resync-period
until the deployment is at the size WPA wants it to be at.
Ok I think since we are updating the status in WPA every reconciliation loop, hence it is creating new event everytime without respecting resync.
Doing WPA status update only when changes happen and doing noop events can prevent this.
@michael-careplanner can you confirm the 200ms is happening in the queues/wpa objects with else:
if workerPodAutoScaler.Status.CurrentReplicas == currentWorkers &&
workerPodAutoScaler.Status.AvailableReplicas == availableWorkers &&
workerPodAutoScaler.Status.DesiredReplicas == desiredWorkers &&
workerPodAutoScaler.Status.CurrentMessages == queueMessages {
klog.V(4).Infof("%s/%s: WPA status is already up to date\n", namespace, name)
return
} else {
klog.V(4).Infof("%s/%s: Updating wpa status\n", namespace, name)
}
Yes, I see lots of messages like this while it's doing the 200ms reconciliation loop:
controller.go:770] live-api/cpapi-transport: Updating wpa status
Followed by a single controller.go:767] live-api/cpapi-transport: WPA status is already up to date
at the end of the fast loop. So looks like it's repeatedly executing that if statement above until it hits the truthy case, then it roughly respects the resync-period
before running again:
I0208 08:49:18.857614 1 controller.go:767] live-api/cpapi-transport: WPA status is already up to date
I0208 08:49:53.650846 1 controller.go:626] cpapi_live_transport min=1, max=300, targetBacklog=10
Yes then it is confirmed, the update status is leading to very fast re-queue. Cant stop doing updates to the status since queue messages and others details are updated with every update, also the dashboards are there on top of these status.
We may have to consider scale down cool-off after all: --scale-down-delay-after-scale-up
Hello @michael-careplanner
Added scale-down-delay-after-last-scale-activity
.
--scale-down-delay-after-last-scale-activity
scale down delay after last scale up or down (defaults to 600 seconds i.e 10mins)
I have written test cases for it. But have not tested it in any Kubernetes cluster yet.
Here is the image: practodev/workerpodautoscaler:v1.5.0-9-gbeb0731
Just tested this and hitting an issue where it returns scale-up
when scaling down and therefore not respecting the cool down:
I0215 11:25:32.221965 1 controller.go:458] cpapi_live_transport current: 96 I0215 11:25:32.221969 1 controller.go:459] cpapi_live_transport qMsgs: 98, desired: 95 I0215 11:25:32.240002 1 controller.go:519] cpapi_live_transport scaleOp: scale-up
Looks like GetScaleOperation expects desiredWorkers
then currentWorkers
, but when it's called it's passing currentWorkers
then desiredWorkers
, so the GetScaleOperation
logic is inverted!
@michael-careplanner Fixed it here.
New image with the above change:
practodev/workerpodautoscaler:v1.5.0-11-g7d12579
Tested that version, it now correctly infers the scale type, but it's still not respecting the cooldown - log:
I0216 09:48:55.389034 1 controller.go:660] cpapi_live_transport min=1, max=300, targetBacklog=10
I0216 09:48:55.389049 1 controller.go:683] cpapi_live_transport qMsgs=55, qMsgsPerMin=-1
I0216 09:48:55.389057 1 controller.go:685] cpapi_live_transport secToProcessJob=0, maxDisruption=1%
I0216 09:48:55.389061 1 controller.go:687] cpapi_live_transport current=115, idle=-1
I0216 09:48:55.389066 1 controller.go:689] cpapi_live_transport minComputed=1, maxDisruptable=2
I0216 09:48:55.389073 1 controller.go:458] cpapi_live_transport current: 115
I0216 09:48:55.389079 1 controller.go:459] cpapi_live_transport qMsgs: 55, desired: 113
I0216 09:48:55.389091 1 scale_operation.go:60] cpapi_live_transport scaleDown is allowed, cooloff passed
I0216 09:48:55.394090 1 controller.go:519] cpapi_live_transport scaleOp: scale-down
I0216 09:48:55.394109 1 controller.go:806] live-api/cpapi-transport: Updating wpa status
I0216 09:48:55.587844 1 controller.go:827] live-api/cpapi-transport: Updated wpa status
I0216 09:48:55.587944 1 controller.go:660] cpapi_live_transport min=1, max=300, targetBacklog=10
I0216 09:48:55.587958 1 controller.go:683] cpapi_live_transport qMsgs=55, qMsgsPerMin=-1
I0216 09:48:55.587965 1 controller.go:685] cpapi_live_transport secToProcessJob=0, maxDisruption=1%
I0216 09:48:55.587977 1 controller.go:687] cpapi_live_transport current=113, idle=-1
I0216 09:48:55.587982 1 controller.go:689] cpapi_live_transport minComputed=1, maxDisruptable=2
I0216 09:48:55.587988 1 controller.go:458] cpapi_live_transport current: 113
I0216 09:48:55.587994 1 controller.go:459] cpapi_live_transport qMsgs: 55, desired: 111
I0216 09:48:55.588005 1 scale_operation.go:60] cpapi_live_transport scaleDown is allowed, cooloff passed
I0216 09:48:55.597267 1 controller.go:519] cpapi_live_transport scaleOp: scale-down
I0216 09:48:55.597284 1 controller.go:806] live-api/cpapi-transport: Updating wpa status
I0216 09:48:55.788992 1 controller.go:827] live-api/cpapi-transport: Updated wpa status
I've had a good look over your PR but can't see where the bug is...
I am guessing the status update for lastScaleTime is not reflecting in your WPA object.
status:
LastScaleTime:
This should get updated after every scale activity. It should start with nil.
I am testing this one of our clusters.
@michael-careplanner Hey! Issue was scaleDownDelay
was being set as 0 due to a bug. I fixed it in f1be3a3.
Thanks for bringing out this issue!! I saw the same problem in one of our high throughput queue. Here is the dashboard image after this change, the scale down becomes smooth ๐๐ผ (this worker did not set secondsToProcessOneJob
)
If someone still needs the old behaviour, would need to set the scaleDownDelay to a lower value than default of 10mins.
Below image has the fix, let me know how it works out for you!:
practodev/workerpodautoscaler:v1.5.0-12-gf1be3a3
@michael-careplanner This is good to release from our end. Working perfectly in our latest cluster. Did you get the chance to test it?
@alok87 Yeah, this is working well on our end now. Thanks very much for all your work on this!