job stuck in active state if saq process got killed
tiejunhu opened this issue · 0 comments
tiejunhu commented
When the saq process doesn't exit cleanly, the current active jobs got stuck in active state and never got retried after saq restarted.
I believe the heartbeat property is not designed for this scenario, the sweep job aborts the job with heartbeat timeout. But for this scene, the job should be retried.
I suggest the job should record it's worker ID, and if the sweep finds that worker is not available anymore, the job should get retried.