tobymao/saq

job stuck in active state if saq process got killed

tiejunhu opened this issue · 0 comments

When the saq process doesn't exit cleanly, the current active jobs got stuck in active state and never got retried after saq restarted.

I believe the heartbeat property is not designed for this scenario, the sweep job aborts the job with heartbeat timeout. But for this scene, the job should be retried.

I suggest the job should record it's worker ID, and if the sweep finds that worker is not available anymore, the job should get retried.