adjtomo/seisflows

failsafe: re-submit workflow on main job walltime error

bch0w opened this issue · 0 comments

bch0w commented

TL;DR Would be nice to have a failsafe where the main job re-submits itself to the System if approaching walltime

I have consistently run into an issue where my main SeisFlows job, which usually sits on a cluster node, will hit walltime before the workflow has completed. Usually this is the result of run tasks (submitted to cluster) getting stuck in the queue, and the main job essentially sitting and waiting until it hits walltime.

Usually walltime is hard set by the cluster (e.g., 24hr) and so this is not something that can be remedied by increasing walltime. One solution to this is to submit the main job directly to the login node, but at the moment the code is not optimized well enough and some large array manipulations will take place on the main job which cluster sys admins may not like.

One possible alternative is to have the main job monitor its own walltime, and if it gets too close while still running through the workflow, it submits a new main job in its place, as a sort of 'failsafe'. It would need to send information about current position in the workflow and current jobs that are sitting in the queue. Likely this would be critically tied to the checkpointing system #144 that is in the works.