TNG/momo-scheduler

Job fails with Unexpected Error after ECONNRESET

Closed this issue · 8 comments

Hello,

In our validation environment, a job failed with the message "an unexpected error occurred while executing job" with payload
{ error: "read ECONNRESET", name: <redacted>, type: "executing job failed"}

Since then, the job has not been executed anymore despite being scheduled for every 15 minutes.

Currently, I cannot provide much more information as I first have to adjust the service's log level, but will update.

Redeployment fixed the stuck execution.

If the job is defined with maxRunning = 1 and for some reason the executions field of the scheduler is not properly updated in the DB with inc: -1 for that particular job after a failed run, this is actually correct.

Momo then thinks there is currently a job running and does not try to run it again, as maxRunning is reached. I was able to reproduce this.

So the real question is why momo failed to update the executions.

And I guess a service restart fixes this, since the scheduler is reregistered with a clean executions field.

Unfortunately, we've been unable to reproduce the issue. My suspicion is that for some reason the MongoDB operation itself failed, after which momo's behaviour was just correct. Perhaps we can close this and reopen it, should you experience something similar in the future? Then we can likely get more information via logs or something and go from there.

I guess the question is, how we can make momo more stable. If there really was a mongo error that prevented us from updating the executions and then momo got stuck believing that the job is running forever and never needs to be started again. How could we have recovered from that?

Would some kind of timeout for jobs be useful (probably the user of momo would have to define that as part of the job since momo has no chance of guessing what a reasonable execution time is for a job). After the timeout we assume the job is dead and clean up?

Or - if I understand correctly, the application was just logging an error once and then continued to run as if nothing was wrong, right? - would it be better if momo stated clearer that is got stuck in some error case? Spamming the log with errors so alarms have a good chance to trigger or whatever? :D Would that have been useful?

I was thinking about letting momo build a "job profile" - e.g. how long each job runs and then killing any job that deviates too much from that. Or maybe just spamming logs.

The timeout you suggest is also an option, but it is a bit of guessowork on the user's part. Maybe we can start with adding an error log when momo tries to schedule a job but max running is already reached? That's a weird case anyway, I suppose, so we should probably report it somehow.

We spend quite a lot of time looking into this. From our perspective this was caused by a mongo issue that was correctly reported in the logging.

We don't feel it's momo's responsibility to mitigate such issues. Nevertheless, we'll look into improving our logging with our next release to make debugging this easier in future.

Thank you for looking into this :)