Laravel Job Batching's atomic locks limit scaling
bkuhl opened this issue · 9 comments
Problem
DatabaseBatchRepository::decrementPendingJobs()
and a few other methods in this repository select a record from the database, add a lock on the row, do some work and update the values in the database. The more workers there are processing a single batch, the more impactful these locks are at blocking the queue from doing work and occupying resources.
Feature Request
Ability to configure Queue Batches to not record the ids of failed jobs but only keep track of simple counts, enabling a single UPDATE
query to be sufficient when updating pending_jobs
, failed_jobs
, etc on the batch's database record.
My Experience
Recently I began digging into high CPU issues on my application's database. We're deployed using docker and operate 6 Horizon containers, which translates to 48 processes operating on different queues throughout the app with a cap of 42 processes on any given one queue. When enabling AWS RDS Performance Insights I was able to obtain some insight on what's taking so long. It looks as though the culprit is this simple query:
SELECT * FROM `job_batches` WHERE `id` = "91b0f7a8-0693-4cb4-8647-e7509318fbbc" LIMIT 1 FOR UPDATE
In this graph, you can see the blue area signifies that query is having the largest impact.
Using Laravel Horizon, we can watch in these scenarios as the Jobs Per Minute
metric dwindles as these locks begin blocking workers from performing their tasks but the number of jobs in the queue stays in the tends of thousands.
To resolve this issue, we stopped using Laravel batches.
Interesting, it does seem that decrementPendingJobs
has really no need to update failed_job_ids
. So, definitely looks like a performance improvement is possible.
It does need to update it, because it's effectively removing the pending job from the failed list, if it was there. The underlying performance limitations are created by maintaining the list of jobIds in the batch data.
Why do you think the performance is impacted by storing the failed jobs IDs?
If the failed jobIds weren't stored, the database record could be updated using a simple incrementing value such as x = (x + 7)
which is a much faster update than pulling the value out, performing the a diff on the array to ensure failed job ids are unique and writing the new value to the database. When there's a lot of workers trying to do that at the same time, it creates a backup. I'm not sure if MySQL's JSON_ARRAY_APPEND
would be any more performant.
We need the lock in all cases to ensure the operation is atomic.
Yep, and a simple UPDATE ... SET pending_jobs (pending_jobs - 1), failed_jobs = (failed_jobs -|+ X) WHERE id = ....
would suffice, if we weren't trying to track the id of each failed job.
Performing updates using a primary key is very common, so I'm operating under the assumption that isn't the issue. SELECT ... FOR UPDATE
is a bit less common, and bypassing that functionality in my case resolved the CPU issues. So I'm assuming that technique is the performance bottleneck.
You can override the Illuminate\Bus\BatchRepository
in your app and disable updating the failed job IDs list if you want. We can't simple remove this functionality and we haven't received any similar complaints about it even from high load applications.
@bkuhl Example implementation using JSON_ARRAY_APPEND
and JSON_REMOVE
.
https://gist.github.com/fedeisas/a81a772b2f8fc63d52f8dde701fbf102