Large retry queue causes reaper to run too slow
dedman opened this issue · 6 comments
Describe the bug
We just upgraded from sidekiq-unique-jobs 5.0.11 to 7.1.29, and when we deployed to production we noticed a lot of unique jobs having locks orphaned, which stopped some of our important sidekiq cron jobs from running. After digging into how the ruby reaper works, I think the issue is our large retry queue which is usually has around 3k jobs in it. I think the reaper was just continually timing out after 10 seconds and never actually getting to remove any orphaned locks.
I eventually narrowed the issue down to the zscan with a count=1, which is done on this line. This line is taking ~2 seconds to execute per call, so even if we set a long timeout on the reaper it still isn't anywhere near fast enough to be effective in our situation.
I've tested locally and if I change the count for the zscan count to be a 100 or 1000 the performance of the reaper is much better. I'm no expert in Redis, and only just read the docs about zscan, so I'm not sure if there is a specific reason the zscan count is set to 1? Would you be open to a PR that either removed the count (I think redis would default to 10 in that case), or making the count configurable?
Expected behavior
The reaper should work even when there is a large amount of jobs in the retry queue.
Current behavior
The reaper times out when there is a large retry queue and barely checks for any orphans.
Additional context
Thank you for such an awesome gem, and the amount of work you have put into it over the years :)
I suspect that if the zscan count was increased then this sort of thing may not be needed, or the MAX_QUEUE_LENGTH could be greatly increased.
Initially I thought this might have been because we were running sidekiq 6 and redis 4, but I also tested on sidekiq 7 and redis 7 and sidekiq-unique-jobs 8.0.1 and saw the same issue.
We are also in the process of fixing our workers so that we don't continually have so many retries in the queue, but even if we fix that in the future having a large amount of retries in the queue due to a transient issue with a worker, shouldn't really mean that jobs don't run because their locks are orphaned.
The original line of thinking was that we only need one item but if performance suffers because of that then I am more than happy to increase it. We could certainly do more.
I guess redis has to do more iterations, the testing perhaps becomes a little more complex. Would be cool to have a real world test for the performance of this since we have had some issues around the reaper performance.
If I recall correctly I have some performance related tests already.
Hi @dedman,
I have tried to run some performance tests on the line of code you suggested.
With 1_000_000 (1 million) rows in the retry queue, I get this:
1) SidekiqUniqueJobs::Orphans::RubyReaper#in_sorted_set? when retried with the job in retry is expected to perform under 2 ms
Failure/Error: it { expect { service.send(:in_sorted_set?, "retry", digest) }.to perform_under(2).ms }
expected block to perform under 2 ms, but performed above 4.26 ms (± 6.11 ms)
Setting the count to 1 or 1000 is hardly noticeable. Agreeably this isn't the real world so might not get real world results.
I think we're hitting this problem too at the moment
Sorry I didn't follow up here earlier. I chatted to my management about this issue earlier in the year and we decided to wait until we were ready to use Sidekiq Enterprise which has a unique jobs feature.
For the last 6 months we kept an eye on the retry queue and tried to keep it under 200-300 jobs. We also increased the timeout on the reaper which lessened the problem a little. We had cron jobs not run probably once a month because of this issue, but we generally noticed early enough to stop it having any big impact.
We were already thinking about moving to Sidekiq Enterprise, so when this issue came up it was the final confirmation it would make sense for us to move over to Enterprise.