mhenrixon/sidekiq-unique-jobs

until_and_while_executing with separate server and client lock_args broken since 8.0.8

Roguelazer opened this issue · 7 comments

Describe the bug

After upgrading from 8.0.7 to 8.0.10, jobs with until_and_while_executing that pass a custom lock_args_method which behaves differently on the server and the client (as shown in the README) no longer work. Based on bisection, this appears to have been broken by 63e9431 (the parent commit, 7d5e40a, still works).

Expected behavior
Given a worker as shown below, run the following client code:

1.upto(2)
  UniqueBugWorker.perform_async(id)
  UniqueBugWorker.perform_async(id)
  UniqueBugWorker.perform_async(id)
  UniqueBugWorker.perform_async(id, "foo")
  UniqueBugWorker.perform_async(id, "foo", "bar")
end

We expect to see exactly six invocations.

Current behavior
Since 63e9431, we end up seeing at most two invocations (and sometimes only 1, which is even more worrying). It appears that the lock digest is being reused between the until and while executing phases.

For some reason, the conflicted jobs also aren't getting rescheduled; they just get dropped on the floor.

worse yet, we end up creating and leaking uniquejobs: locks with the correct digest, so after downgrading to 8.0.7, the jobs are locked out until the lock TTL expires.

Worker class

class UniqueBugWorker
  include Sidekiq::Worker
  sidekiq_options queue: "repro",
    lock: :until_and_while_executing,
    on_conflict: {server: :reschedule, client: :log},
    lock_args_method: ->(args) do
      if Sidekiq.server?
        [args.first]
      else
        args
      end
    end

  def perform(id, foo = nil, bar = nil)
    Rails.logger.info { "UniqueBugWorker: performing #{id} with #{foo} #{bar}" }
    sleep 1
  end
end

Additional context
Add any other context about the problem here.

We seem to have been running into same issues since upgrade but with until_executed (possibly with other locks as well but I only have confirmation about issues with until_executed)

@reneklacan do you also use custom lock args?

we are getting a similar issue sidekiq_options lock: :until_executed, on_conflict: :reschedule error => SystemStackError: stack level too deep from <internal:kernel>:185:in `loop

As mentioned above; This appears to be an issue related to until_executing and reschedule. This also affects until_and_while_executing as it contains the until. I think this is actually easier to produce with until-and-while. You can produce the error with the following;

class BrokenJob
  include Sidekiq::Job
  sidekiq_options lock: :until_and_while_executing, on_conflict: :reschedule
  def perform = sleep(10)
end

BrokenJob.perform_async
#=> "job_id"
BrokenJob.perform_async
# SystemStackError: stack level too deep

For :until_executing its easier to reproduce if you just don't run the sidekiq worker process at all (or while the queue is backed up.)

From what I can tell; If the job is enqueued it fails to get the lock (until). While rescheduling, it checks this lock again and... since its locked, tries to reschedule... What I think should happen here is if the job is being rescheduled (or.. scheduled in the future in general?), it should ignore checking the lock?

As an asside; Interestingly, sometimes I can get this to actually run two jobs at the same time, and some times the second invocation returns nil and the 3rd will produce a Stack error. So the locking in general is a bit racey to begin with?

That is not great if the lock is racey as the whole point of the lock is to prevent race condition

To extend...

I tried updating this to remove the lock while OnConflict::Reschedule is doing perform_in but... it really doesn't matter. The gem prioritizes the classes options, over the jobs options. So even if you .set(lock: nil) it purposefully ignore the lock setting, it still uses the lock setting from the class.