practo/k8s-worker-pod-autoscaler

Can one of `targetMessagesPerWorker` and `secondsToProcessOneJob` be deduced from the other

sujanadiga opened this issue · 3 comments

As per the current implementation targetMessagesPerWorker is a mandatory option and secondsToProcessOneJob is an optional one with default=0

As per my understanding, secondsToProcessOneJob is used to calculate minimum number of workers and targetMessagesPerWorker is used to calculate the usage ratio and hence to determine desired number of workers.

Since both the values can be tuned separately, if these values are not in sync, it can result in undesired scaling behaviour.
Taking the example from one of the test cases,

// TestScaleUpWhenCalculatedMinIsGreaterThanMax
// when calculated min is greater than max
func TestScaleUpWhenCalculatedMinIsGreaterThanMax(t *testing.T) {
queueName := "otpsms"
queueMessages := int32(1)
messagesSentPerMinute := float64(2136.6)
secondsToProcessOneJob := float64(10)
targetMessagesPerWorker := int32(2500)
currentWorkers := int32(10)
idleWorkers := int32(0)
minWorkers := int32(2)
maxWorkers := int32(20)
maxDisruption := "0%"
expectedDesired := int32(20)
desiredWorkers := controller.GetDesiredWorkers(
queueName,
queueMessages,
messagesSentPerMinute,
secondsToProcessOneJob,
targetMessagesPerWorker,
currentWorkers,
idleWorkers,
minWorkers,
maxWorkers,
&maxDisruption,
)
if desiredWorkers != expectedDesired {
t.Errorf("expected-desired=%v, got-desired=%v\n", expectedDesired,
desiredWorkers)
}

As per this snapshot, a worker would take 10s to process a single job, and there were approximately 2136.6 messages sent in the last 1 minute. This would make the minimum number of workers needed to 21366, but the number of desired workers will be calculated as 1 since once worker can handle 2500 messages in a minute(targetMessagesPerWorker is 2500, but there is only one message in the queue)

Outcome:
Min: 21366
Max: 20
Desired calculated: 1
Desired: 20(capped to max workers)

Yes, allowing both targetMessagesPerWorker and secondsToProcessOneJob to be configured separately might help in cases where we want to clear backlog as fast as possible(@justjkk 's comment), however it is true only for 50% of cases where

targetMessagesPerWorker < (60 / secondsToProcessOneJob)

My question is, can one of targetMessagesPerWorker and secondsToProcessOneJob be deduced from the other using the formula

secondsToProcessOneJob = 60 / targetMessagesPerWorker

to avoid scaling issues due to misconfiguration?

  • targetMessagesPerWorker is useful for long running workers that take seconds or even minutes to process a job and jobs per minute is almost 0. In this case, secondsToProcessOneJob is ineffective because RPM(averaged over 10 minutes) is almost 0.

  • secondsToProcessOneJob is useful for fast workers that consume jobs so fast that queued jobs is always at 0 and jobs per minute is high. In this case, targetMessagesPerWorker is ineffective because queued jobs is 0.

#105 (currently WIP) will update the documentation of these parameters and also provide example scenarios.

Regarding using a single configurable value and then using it to calculate the other with the below formula:

secondsToProcessOneJob = 60 / targetMessagesPerWorker

The above conversion assumes 1 minute as the desired time to process the backlog. Since targetMessagesPerWorker is not a per minute value, the above formula cannot be used. targetMessagesPerWorker is just the target ratio that the WPA controller tries to maintain between the queued jobs(available + in-process jobs) and the current workers.

So, we still have two out of three variables that needs to be specified and the third can be calculated from that. The actual formula can be:

secondsToProcessOneJob = desiredSecondsToClearQueuedMessages / targetMessagesPerWorker

Got it. Thank you @justjkk for clarifying.
Thank you @alok87 for sharing supporting test cases.