Scaling with spiky queues
mf-lit opened this issue · 6 comments
I'm looking at using worker-pod-autoscaler as a means to scale our application based on an SQS queue. Overall it's been working well, indeed it behaves exactly as described in the docs.
However, it's behaviour doesn't well fit the nature of our workloads so I'm opening this issue to describe the problem and hopefully start some discussion about a possible solution.
Our queue backlog is "spiky", large numbers of messages arrive on the queue both unpredictably and suddenly. To demonstrate this I created the following WorkerPodAutoscaler spec:
apiVersion: k8s.practo.dev/v1
kind: WorkerPodAutoScaler
spec:
deploymentName: cpapi-transport
maxDisruption: 100%
maxReplicas: 100
minReplicas: 0
queueURI: https://sqs.eu-west-1.amazonaws.com/xxxx/queue_name
secondsToProcessOneJob: null
targetMessagesPerWorker: 40
Then I sent 1000 messages to the empty queue (this took about 10 seconds).
A graph of the queue length over the coming minutes looks like this:
And a graph of desired replicas looks like this:
We can see that WPA behaves as expected. Replicas scale immediately to (1000/40=) 25, and queue length drops rapidly as a result. Then as the queue continues to fall replicas are removed and the message removal rate slows until some time later the queue is finally back to zero and replicas goes to zero too.
The problem for us is the way the number of workers reduces in proportion to the queue length, which means the removal-rate is constantly falling and therefore items remain on the queue for longer than we would like. For us, the length of the queue backlog is irrelevant as a SLO, what matters is the amount of time items are sitting on the queue.
In the example above we can see that it's taken eight minutes for the queue to finally reach zero. For our use-case we do not want messages on the queue for any more than five minutes. I could try to mitigate this by reducing targetMessagesPerWorker, and this may be reasonably effective but will result in a lot of initial replicas and still suffer from an ever-decreasing removal rate, a very inefficient solution. Also behaviour would be different for larger/smaller spikes.
My suggestion would be a third alternative metric (in addition to targetMessagesPerWorker
and secondsToProcessOneJob
) called something like maxDesiredMessageAge
.
To create an algorithm based on that metric we also need to know some more information about the queue:
- How old is the oldest message?
- How long until the queue reaches zero based on the current removal rate?
For SQS both those metrics can be found (or derived at least) from Cloudwatch:
ApproximateAgeOfOldestMessage
ApproximateNumberOfMessagesVisible/NumberOfMessagesDeleted
The algorithm would then be a loop that looks something like this (pseudo-code):
AGE_OF_OLDEST_MESSAGE=ApproximateAgeOfOldestMessage
ESTIMATED_TIME_TO_ZERO=ApproximateNumberOfMessagesVisible/NumberOfMessagesDeleted
if ESTIMATED_TIME_TO_ZERO > AGE_OF_OLDEST_MESSAGE
SCALE_METRIC=ESTIMATED_TIME_TO_ZERO
else
SCALE_METRIC=AGE_OF_OLDEST_MESSAGE
DESIRED_REPLICAS = ceil[CURRENT_REPLICAS * ( SCALE_METRIC / maxDesiredMessageAge )]
This should result in replicas ramping up more steadily as a result of the spike but remaining scaled-up in order to clear the queue within the target time.
You'll probably notice that the final calculation is pretty much what HorizontalPodAutoscaler does. However, using HPA doesn't work for this as it doesn't synchronise metrics with scaling events resulting in massive-overscaling, i.e. it scales-up metrics and then scales again and again because the queue metrics dont update quickly enough.
My example is very simplistic, but I'm just trying to figure out if it's feasible in theory at least. Would love to hear some thoughts...
I understand your concern. Thanks for raising this here.
I see you have not tried out secondsToProcessOneJob. Using that will keep a minimum number of pods always running based on the queue RPM. Please try it once, this may solve the issue as the surge would not be that spiky after that.
Also we try to not have such spiky load on the system as your pods might take it but a 3rd party system might go down :) but that is a separate discussion and not a concern for WPA.
We want to keep the metrics to minimum, flags to minimum so that the whole system remains simple and easy to use, do not wish to complicate the system untill very necessary.
Please try secondsToProcessOneJob and give feedback.
Also you could control the scale down in this use case by using maxDisruption spec.
Using maxDisruption you have the control of how quickly or slowly you scale down.
This helps you scale down in steps.
Please try this also once.
Hi thanks for the reponse and the suggestions. I've run a few simple tests based on these and results are below:
Test 1: targetMessagesPerWorker: 40, maxDisruption: 100%, secondsToProcessOneJob: null (same as in my original post)
Test 2: targetMessagesPerWorker: 40, maxDisruption: 10%, secondsToProcessOneJob: null
Test 3: targetMessagesPerWorker: 40, maxDisruption: 100%, secondsToProcessOneJob: 2
Test 4: targetMessagesPerWorker: 40, maxDisruption: 100%, secondsToProcessOneJob: 10
These four tests are graphed below from left to right:
We see that time to get from 1000 messages to 0 messages is as follows:
Test 1: 8m0s
Test 2: 7m40s
Test 3: 7m0s
Test 4: 8m0s
The graph of replicas is below:
As you can see, the difference when maxDisruption is there, but not huge. And the behaviour when secondsToProcessOneJob is much the same time-wise and the replica-count is just plain weird. The replicas spike back up when the count is near-zero.
I increased verbosity to 4 on the controller while running these tests, here's a snippet from during that period when the workers spike back up. It seems the controller believes messages are arriving on the queue??
I0821 16:06:17.546742 1 sqs.go:396] cpapi_orange_transport: messagesSentPerMinute=200
I0821 16:06:17.578893 1 sqs.go:415] cpapi_orange_transport: approxMessages=0
I0821 16:06:17.586859 1 sqs.go:436] approxMessagesNotVisible=1
I0821 16:06:17.587026 1 sqs.go:447] cpapi_orange_transport: approxMessagesNotVisible > 0, not scaling down
I0821 16:06:26.348490 1 sqs.go:436] approxMessagesNotVisible=0
I0821 16:06:36.224885 1 controller.go:618] cpapi_orange_transport min=0, max=100, targetBacklog=40
I0821 16:06:36.224905 1 controller.go:590] 10, workersBasedOnMessagesSent=34
I0821 16:06:36.224915 1 controller.go:641] cpapi_orange_transport qMsgs=1, qMsgsPerMin=200
I0821 16:06:36.224922 1 controller.go:643] cpapi_orange_transport secToProcessJob=10, maxDisruption=100%
I0821 16:06:36.224929 1 controller.go:645] cpapi_orange_transport current=34, idle=-1
I0821 16:06:36.225043 1 controller.go:647] cpapi_orange_transport minComputed=34, maxDisruptable=34
I0821 16:06:36.225050 1 controller.go:434] cpapi_orange_transport qMsgs: 1, desired: 34
I0821 16:06:36.225076 1 controller.go:761] orange-api/cpapi-transport: Updating wpa status
I0821 16:06:36.232991 1 controller.go:781] orange-api/cpapi-transport: Updated wpa status
I0821 16:06:36.233666 1 controller.go:618] cpapi_orange_transport min=0, max=100, targetBacklog=40
I0821 16:06:36.233678 1 controller.go:590] 10, workersBasedOnMessagesSent=34
I0821 16:06:36.233685 1 controller.go:641] cpapi_orange_transport qMsgs=1, qMsgsPerMin=200
I0821 16:06:36.233691 1 controller.go:643] cpapi_orange_transport secToProcessJob=10, maxDisruption=100%
I0821 16:06:36.233697 1 controller.go:645] cpapi_orange_transport current=34, idle=-1
I0821 16:06:36.233702 1 controller.go:647] cpapi_orange_transport minComputed=34, maxDisruptable=34
I0821 16:06:36.233707 1 controller.go:434] cpapi_orange_transport qMsgs: 1, desired: 34
I0821 16:06:36.233721 1 controller.go:758] orange-api/cpapi-transport: WPA status is already up to date
Hi @mf-lit,
Thank you for the comprehensive problem statement. One caveat with spiky traffic is that the calculations and the experiments that we discuss can result in weird outcomes because:
- The Cloudwatch metrics are not always accurate. They even prefix some metrics calling them
Approximate
to emphasize this. - The Cloudwatch metrics are not realtime. They are delayed by at least a minute.
- The current calculations assume the pod boot up time to be negligible or 0.
I really like your idea of scaling based on the ApproximateAgeOfOldestMessage metric and the maxDesiredMessageAge config parameter. We already use the ApproximateAgeOfOldestMessage metric for our cloudwatch alarms but we haven't explored this possibility before.
The calculations and the config parameters in WPA are designed to complement each other and optimize primarily for 2 use-cases:
Case-1 - Long-running workers with low throughput: (like 2 minutes to process a job and having 0-10 jobs per minute)
Here, targetMessagesPerWorker is essential to scale up more effectively. Partial scaling down can be problematic here because graceful shutdown may require more time than allowed by k8s configuration and so setting maxDisruption to 0 is better. Setting maxDisruption to 0 also ensures that clearing the backlog is prioritized over saving cost. (min=0 is also useful here for saving cost assuming throughput is very low and pods boot up fast enough.)
Case-2 - Fast workers with high throughput: (like < 1 second to process a job and having > 1000 jobs per minute)
Here targetMessagesPerWorker doesn't work because the backlog is almost always at 0. secondsToProcessOneJob is useful here to prevent scale down and set the desired based on throughput averaged over the last 10 minutes. This also explains why you were seeing the qMsgsPerMin=200
in the logs because the value is averaged out.
I think your use-case can be better mapped to Case-1 than to Case-2. If so, setting maxDisruption to 0 is recommended to ensure that the backlog is consumed as fast as possible and scaling down happens only when all workers are idle. However, if all the workers are never idle, then it can result in over-provisioned infrastructure.
If that doesn't work and we need to treat it as a new case(Case-3), then the solution that you proposed can be used to ensure the desired workers don't fall below the calculated value. We should however explore if this can cause some unexpected issues. One minor issue that I can think of is if the pod crashes and the job goes into messagesNotVisible for the rest of the visibility timeout, this can trigger unnecessary scale up because the ApproximateAgeOfOldestMessage metric will go up even though all visible messages are getting consumed immediately. Also, the AWS documentation for the ApproximateAgeOfOldestMessage metric warns about poison-pill messages and also jobs that are received more than 3 times.
maxDisruption=10%
should have helped more than what is shown in the experiments but reducing the short poll interval makes it less effective. Based on the logs, I think you have set it to 10 seconds instead of the default 20 seconds.
Hi @justjkk , thanks for taking the time to respond so thoroughly. Sorry it's taken a while to get back to you, I've been away.
If that doesn't work and we need to treat it as a new case(Case-3), then the solution that you proposed can be used to ensure the desired workers don't fall below the calculated value. We should however explore if this can cause some unexpected issues. One minor issue that I can think of is if the pod crashes and the job goes into messagesNotVisible for the rest of the visibility timeout, this can trigger unnecessary scale up because the ApproximateAgeOfOldestMessage metric will go up even though all visible messages are getting consumed immediately. Also, the AWS documentation for the ApproximateAgeOfOldestMessage metric warns about poison-pill messages and also jobs that are received more than 3 times.
These are very good points. I guess some of these could be compensated for, but it would make the algorithm/logic used rather more complicated which may not be desirable.
maxDisruption=10%
should have helped more than what is shown in the experiments but reducing the short poll interval makes it less effective. Based on the logs, I think you have set it to 10 seconds instead of the default 20 seconds.
Actually no, here is the complete command line I'm using:
/workerpodautoscaler run --resync-period=20 --wpa-threads=10 --aws-regions=eu-west-1 --sqs-short-poll-interval=20 --sqs-long-poll-interval=20 --k8s-api-qps=5.0 --k8s-api-burst=10 --wpa-default-max-disruption=100% --queue-services=sqs -v=4
Like you I was also surprised that maxDisruption=10% wasn't more effective. I've also tried 1% and that too did not improve things. This can be seen on the left of the graph below.
I think your use-case can be better mapped to Case-1 than to Case-2. If so, setting maxDisruption to 0 is recommended to ensure that the backlog is consumed as fast as possible and scaling down happens only when all workers are idle. However, if all the workers are never idle, then it can result in over-provisioned infrastructure.
Out of curiosity I tried maxDisruption=0% despite it not being suitable for our production workloads (our queue rarely reaches zero). This is shown on the right of the graph. It certainly improved the time taken to clear the queue, but desiredReplicas never came down when the queue was down to zero??
In v1.6.0 we have released --scale-down-delay-after-last-scale-activity int scale down delay after last scale up or down in seconds (default 600)
Resync period is not respected when the updates to status happen frequently due to changes in the status object. So we have added this cooldown to make scale down happen only at 10mins interval after last scale activity and not in 200ms which was happening.
I am hoping using this should solve the problem you have mentioned above as there would be more workers running and the queue would get cleared up soon.