breamware/sidekiq-batch

Success Callback calling before all jobs are complete

Closed this issue · 11 comments

With more logging, I've been able to spot jobs executing after the on_success callback execution. I'm not sure what detail I can provider, but here we go. I am happy to provide more I'm just not sure what else would be helpful.

Redacted logs:

Nov 08 05:09:41pm info Starting batch b-5j1infmvnMQw
Nov 08 05:09:43pm info Running in batch b-5j1infmvnMQw
Nov 08 05:09:44pm info Running in batch b-5j1infmvnMQw
Nov 08 05:09:46pm info Running in batch b-5j1infmvnMQw
Nov 08 05:09:47pm info Running in batch b-5j1infmvnMQw
Nov 08 05:09:48pm info Completed batch b-5j1infmvnMQw
Nov 08 05:09:48pm info Successfully ran batch b-5j1infmvnMQw {"total":0,"failures":0,"pending":0,"created_at":null,"complete":false,"failure_info":[],"parent_bid":null}
Nov 08 05:10:00pm info Running  in batch b-5j1infmvnMQw
Nov 08 05:10:01pm info Running  in batch b-5j1infmvnMQw
Nov 08 05:10:04pm info Running  in batch b-5j1infmvnMQw

Thought it was strange the total field in the data was 0, testing this locally as well. The total is correct up until the on_success is called, it then goes to 0.

Code that creates the batch:

id = '1234' # fake example of parameters
batch = Sidekiq::Batch.new
batch.on(
  :success,
  Integration::MethodWorker,
  id: id
)
batch.on(:complete, MyWorker)
batch.callback_queue = 'queue'
batch

Workers enqueuing

batch.jobs
  MyWorker.perform_in(2.seconds)
  MyWorker.perform_in(3.seconds)
  MyWorker.perform_in(4.seconds)
  MyWorker.perform_in(5.seconds)
  MyWorker.perform_in(6.seconds)
  MyWorker.perform_in(7.seconds)
  MyWorker.perform_in(8.seconds) # there is an if statement on this (wouldn't think that matters)
en

Environment:

Ruby: 2.5.0
Rails: 5.1.4

@davidrichey Did you find a solution for this? It's my first time trying the sidekiq-batch gem, so I had a quick look at the open issues, and this sounds like a serious bug. I guess this only applies to scheduled jobs that run in the future, so maybe I'll be ok for now, because I don't need batches for scheduled jobs.

I have not unfortunately, we had to roll our own solution since this was not resolved.

Oh no, I just ran into this issue as well!

Fortunately I've added a lot of profiling data to my jobs and included timestamps, so I'm 100% sure that the on_success callback was fired too early, and before the job had finished.

I'm actually using database records to track all of my jobs as well. I was doing a sanity check in the on_success callback to ensure that the state had transitioned to processed, but it crashed because it was still pending.

For the job that was still pending and was supposed to be processed, I saved these timestamps:

"job_start_time": "14:10:47.002"
"job_end_time": "14:11:48.628"

I saw that my on_success callback updated the record with an error at 14:11:46, which is before the batch job had even started.

@managr - I was wondering if you are calling the on_success callback as soon as there are no jobs left in the queue? Or are you waiting for the jobs to finish before calling it?

EDIT: In the meantime I might just go back to my old way of managing batches manually with a database lock. I still use Sidekiq, but I track the total and pending counts in the database, and use a database lock to decrement the pending count. Then when it reaches zero I trigger the on_success callback. But I was trying to get away from that, so hopefully this can be fixed.

I'm running in a similar problem, where a relatively complex batch with children batches is calling on_success as soon as its first child completes.

This is a serious bug that prevents this gem to be reliable for any serious project.

nglx commented

@phildionne any chance that you could test if #26 is fixing your issue or not?

@managr just tried and it has indeed worked successfully! how far are you from merging #26 ? let me know if you'd like me to do other tests.

nglx commented

@phildionne let me ping @jbrady42, he mentioned that he'll be fixing the PR, I don't really want to do another PR to his one (or grabbing his code).

Any idea when the PR will be released?

Is this released?

Stale issue message