saurabhnanda/odd-jobs

Jobs which time out never hit the attempt limit

Opened this issue · 5 comments

Maybe this is the intended behaviour for Odd Jobs, from your requirements - apologies if so.

Because job runners only check the number of attempts on a job after the job fails i.e. throws an exception, long-running jobs which time out will be repeatedly picked up by job runners with an ever-increasing attempt count. My expectation was that jobs which have timed out and have too many attempts should be ignored by the polling logic.

This is a bug. If you plan to raise a PR for this, it might be a good idea to first start with a test-case that fails.

Because job runners only check the number of attempts on a job after the job fails i.e. throws an exception, long-running jobs which time out will be repeatedly picked up by job runners with an ever-increasing attempt count. My expectation was that jobs which have timed out and have too many attempts should be ignored by the polling logic.

I probably misread this earlier. I thought that you're trying to point out that job.attempts doesn't increase for timed-out jobs. But I just checked the code and that is not the case. job.attempts is increased in the case of any type of failure (including timeout).

Could you try explaining the problem that you're facing in the current job-polling implementation again, please?

PS: Also check if #73 is what you're looking for?

Could you try explaining the problem that you're facing in the current job-polling implementation again, please?

When a job is picked up, either because it is queued or because it is locked and timed out, the attempt count is not checked, but is incremented. The attempt count check only happens when the job fails (throws an exception). If the job instead times out, it will just be picked up again - but the attempt count will not be checked against the configured max attempts to see if the job should run again.

This means that jobs that keep timing out will have their attempt count increment with each timeout, but the attempt count will never be checked to prevent them from running and mark them as failed.

Ah - I understand now. And thanks for following this through. It's a subtle catch.

The problem is with job timeouts that are not caught by runJobWithTimeout (that function throws a TimeoutException which is cause and goes through the retryOrFail check). The problem is with jobs that remain in a locked state and are forcibly picked up the jobPollingSql.

Are you seeing a lot of these kinds of timeouts? Is runJobWithTimeout not handling this for you?

Are you seeing a lot of these kinds of timeouts? Is runJobWithTimeout not handling this for you?

It seems that this was partly due to a problem with some dependency that we were using, since recent changes to the code have meant that it no longer occurs.

I believe the problem was that some code in the job logic itself was masking async exceptions, so runJobWithTimeout would throw the timeout exception to the job thread, and it would then block forever.

That in itself doesn't seem like a bug that Odd Jobs can do anything about - but I do think the polling logic should be checking the attempts counter before trying to queue up jobs that are locked and timed out.