insitro/redun

Cleaning up Batch jobs on unexpected termination

Opened this issue · 1 comments

If a keyboard interrupt halts an in-progress redun execution, any in-flight AWS Batch jobs will keep on running after redun has exited. Ideally those jobs would be cancelled before redun shuts down.

Perhaps relatedly, on keyboard interrupt, redun hangs before exiting and after printing "Shutting down... Ctrl+C again to force shutdown." I'm unsure what conditions in my configuration are causing this hang.

Would adding this cleanup require changes to the scheduler and executor interfaces? Perhaps a new executor hook could be added to be invoked when a job is rejected? Though I'm unclear on whether redun actually tries to reject jobs on keyboard interrupt.

Cheers!
Dan Spitz

Thanks @spitz-dan-l for posting this. The current behavior is opinionated. As you state, if the scheduler is killed (e.g. with Ctrl+C) the AWS Batch jobs do continue until completion. If you start the scheduler again, redun will attempt to reunite with the jobs or their final outputs in S3.

If you really want to kill all AWS Batch jobs, there is a lower level command to do that:

redun aws kill-jobs

There are also a few filters to kill a subset of jobs (e.g. by status).

There are some plans for adding job canceling / killing as a mechanism. It could be used during Ctrl+C if that's desired. It could also be used if one Job fails and there is no catch(), so all sibling jobs should be auto-canceled. We are thinking though how users can specify those different behaviors. Any ideas you have are welcomed!