mhenrixon/sidekiq-unique-jobs

Read Timeouts and high CPU spike after upgrading from 8.0.5 to 8.0.10

mjooms opened this issue · 9 comments

Describe the bug
Experienced read timeouts > 5 sec after upgrading from 8.0.5 to 8.0.10 and redis spiking to high cpu leading to downtime in our prod environment. Slow log showed many ZRANGE commands.

Expected behavior
Since this is a minor upgrade I'd expected everything to keep functioning as usual, perhaps better performance but not worse :)

Current behavior
Performance degraded leading to downtime in our setup. Redis server became busy to a point where new jobs couldn't be added anymore due to connection and read timeouts.

Worker class

Worker Flavor 1 (to queue a job for a user in either 24h, 72h or 168h, keeping the current scheduled job if one already exists for the user):
    sidekiq_options queue: "default",
                    lock: :until_and_while_executing,
                    on_conflict: { client: :log, server: :reject },
                    lock_ttl: 0,
                    retry: false

called with:
WorkerFlavor1.perform_in(24.hours, user_id)
                    
Worker Flavor 2 (to queue a job for a user in either 24h, 72h, or 168h, replacing an existing scheduled job if one already exists for the user): 
       sidekiq_options queue: "low_prio",
                     lock: :until_executed,
                     on_conflict: :replace,
                     lock_ttl: 0,
                    retry: false                 
                    
WorkerFlavor2.perform_in(24.hours, user_id) 

Additional context
We have an average of around 121k of these scheduled jobs in 24h. This worked fine in 8.0.5 but after upgrading to 8.0.10 the Redis server became irresponsive due to high load.

I noticed you and @ezekg have worked on a performance PR recently (#835) which might solve the problem but I wasn't really sure if this addresses the same issue. Also I wasn't able to test 8.0.11 because it hasn't been released yet (it is in the main branch but not yet on rubygems nor as a tag).

I don't think v8.0.11 exists. Not sure what is going on i.r.t. versioning in the master branch. Regardless, the performance work was implemented in v8.0.10 (AFAIK), but perhaps there's a bug that didn't surface in our tests and in my production environment, or a breaking change we didn't see i.r.t. upgrading.

Is there anyway you can share a snippet of those ZRANGE logs? I want to see what args ZRANGE is being called with, since we now optimize the call when a score is available.

If possible, can you write a performance test like this one that fails?

You can run performance tests like this:

bin/rspec --require spec_helper --tag perf

You're right about the master branch, I noticed the commit for version 8.0.11 but it's after tag 8.0.10. v8.0.10 includes the performance fixes, which might actually be causing my issue then...

Unfortunately the slow log (as configured by AWS) does not show the arguments, but luckily it does show the arg count though, which is 3.
Example (got a lot from those after deploy):

{"CacheClusterId":"wbw-production-sidekiq-002","CacheNodeId":"0001","Id":1462864,"Timestamp":1710799484,"Duration (us)":10508,"Command":"ZRANGE ... (3 more arguments)","ClientAddress":"?:0"}

So reading the new code, there is no score from ZSCORE available and it falls back to the old behaviour.

It might indeed be an issue upgrading?

OR could it be that the behaviour of the specified code has slightly changed? At first glance the implementations looks the same, however (and keep in mind I'm not too familiar with Lua) I noticed that there used to be a break in the old and a return in the new implementation in the NESTED for loop.

The return statement not only breaks out of the inner for loop but also the entire function, whereas the old break only breaks out of the inner for loop, continuing with the while loop and possibly ZREM even more items. The result of the old implementation was then the last removed item.

I need to dive into what this functions responsibility exactly is, but could it be there is more than one item that needs to be ZREMed?

Could it be that your lua scripts are old @mjooms?

The gem should update them but sometimes shit happens 🤷‍♂️.

That's my initial thought. @ezekg ’s change is most definitely valid but that was for replacing or deleting a digest. Not sure if that is your scenario, you have replace strategy?

Yes, I do have a replace strategy:

lock: :until_executed,
on_conflict: :replace,
lock_ttl: 0,

I perform a rolling deploy, so the two versions were running simultaneously for a moment. However it did not recover when eventually only the new version was running.

I rolled back to 8.0.5 and the issue went away.

I'm hesitant to try the upgrade once more in production. On our staging environment the problem does not occur, so my first thought was that the upgrade used a lot more CPU than the old version of the Redis server under load.

However the remark of @mhenrixon makes sense too. Since I perform a rolling deploy, v8.0.5 and v8.0.10 are running simultaneously for a while. What would happen in this case? Is v8.0.5 calling the lua functions of v8.0.10 on the redis server? Would that be problematic? Or is there a migration issue perhaps?

@mjooms any update here? Were you able to write a failing performance test that reproduces your behavior here? I'm curious what is going on here.

I have written a performance test that simulates the behaviour by using the same Worker config and adding 100k of them in the queue, but I wasn't able to identify any issues. Its hard to write a test though since I don't really know what exactly caused the CPU on the redis server to spike in the first place...

I wasn't able to reproduce it on our staging environment either yet.

I don't really want to test it out on production once more, but I guess thats the only way to rule out what you suggested... that the lua scripts were outdated 😬

Do you have any tips how I can debug this further?

@mjooms I'm not sure running this in production again is necessary. I'm a little confused as to how v8.0.5 didn't have the buggy behavior that v8.0.10 supposedly did, because all v8.0.10 did was add an additional fast-track. In the worst case, it should behave exactly like v8.0.5. So if it failed in v8.0.10 but NOT v8.0.5, then that signals something in v8.0.10 is backwards incompatible or broken for your unique workload. But without a failing test, I'm not sure what…

Could you try upgrading to v8.0.9 and see if that exhibits the same behavior?

Chiming in here to maybe help get eyes on this — I can confirm there still seems to be a performance issue here. We experienced it again, which caused another production outage. Ultimately, we decided to remove our dependency on SUJ since we were no longer using it for anything mission-critical. Unfortunately, that also means we could not dig deeper into the performance issues. It was too risky to debug in production. As far as I can tell, for some reason or another, my original performance patch isn't always working. I spent a few hours, but I couldn't determine why that's the case so I gave up.

I wish there was a happier ending here, but alas. Thanks @mhenrixon for all you've done. Sorry we can't be of more help moving forward. I wish the project good luck. I hope somebody can get to the bottom of this!