laravel/ideas

Laravel Job Batching's atomic locks limit scaling

bkuhl opened this issue · 9 comments

bkuhl commented

Problem

DatabaseBatchRepository::decrementPendingJobs() and a few other methods in this repository select a record from the database, add a lock on the row, do some work and update the values in the database. The more workers there are processing a single batch, the more impactful these locks are at blocking the queue from doing work and occupying resources.

Feature Request

Ability to configure Queue Batches to not record the ids of failed jobs but only keep track of simple counts, enabling a single UPDATE query to be sufficient when updating pending_jobs, failed_jobs, etc on the batch's database record.

My Experience

Recently I began digging into high CPU issues on my application's database. We're deployed using docker and operate 6 Horizon containers, which translates to 48 processes operating on different queues throughout the app with a cap of 42 processes on any given one queue. When enabling AWS RDS Performance Insights I was able to obtain some insight on what's taking so long. It looks as though the culprit is this simple query:

SELECT * FROM `job_batches` WHERE `id` = "91b0f7a8-0693-4cb4-8647-e7509318fbbc" LIMIT 1 FOR UPDATE

In this graph, you can see the blue area signifies that query is having the largest impact.

Screen Shot 2021-04-28 at 7 26 39 AM

Using Laravel Horizon, we can watch in these scenarios as the Jobs Per Minute metric dwindles as these locks begin blocking workers from performing their tasks but the number of jobs in the queue stays in the tends of thousands.

To resolve this issue, we stopped using Laravel batches.

Interesting, it does seem that decrementPendingJobs has really no need to update failed_job_ids. So, definitely looks like a performance improvement is possible.

bkuhl commented

It does need to update it, because it's effectively removing the pending job from the failed list, if it was there. The underlying performance limitations are created by maintaining the list of jobIds in the batch data.

Why do you think the performance is impacted by storing the failed jobs IDs?

bkuhl commented

If the failed jobIds weren't stored, the database record could be updated using a simple incrementing value such as x = (x + 7) which is a much faster update than pulling the value out, performing the a diff on the array to ensure failed job ids are unique and writing the new value to the database. When there's a lot of workers trying to do that at the same time, it creates a backup. I'm not sure if MySQL's JSON_ARRAY_APPEND would be any more performant.

We need the lock in all cases to ensure the operation is atomic.

bkuhl commented

Yep, and a simple UPDATE ... SET pending_jobs (pending_jobs - 1), failed_jobs = (failed_jobs -|+ X) WHERE id = .... would suffice, if we weren't trying to track the id of each failed job.

bkuhl commented

Performing updates using a primary key is very common, so I'm operating under the assumption that isn't the issue. SELECT ... FOR UPDATE is a bit less common, and bypassing that functionality in my case resolved the CPU issues. So I'm assuming that technique is the performance bottleneck.

You can override the Illuminate\Bus\BatchRepository in your app and disable updating the failed job IDs list if you want. We can't simple remove this functionality and we haven't received any similar complaints about it even from high load applications.

@bkuhl Example implementation using JSON_ARRAY_APPEND and JSON_REMOVE.

https://gist.github.com/fedeisas/a81a772b2f8fc63d52f8dde701fbf102