adjtomo/seisflows

Missing resume function

evcano opened this issue · 6 comments

Hi @bch0w,

I noticed that the resume function is missing from seisflows.py on the developer branch. However, it is included in the 2.0.0 Seisflows release. Is there any reason for this?

bch0w commented

Hi @evcano, in the transition from v2.0.0 to v2.1.0 (#125), I overhauled the submit and resume system of SeisFlows, such that the newest version only really required a single submit function (a corresponding resume function would effectively be doing the same thing).

Now recognizing that this is an API breaking change I should have changed the version number to v3.0.0, as opposed to v2.1.0 (looking back, I actually wrote this in the release note of v2.1.0: https://github.com/adjtomo/seisflows/releases).

Do you think a resume function would be useful? I have had others comment that running submit to resume a workflow feels less intuitive. I could imagine a resume function having some more intricate state checking and log messages telling the User where they are resuming from. This is likely related to #144

@bch0w Thanks for the detailed answer!

I do not think a resume function is necessary if submit does the same thing. Adding the --RESUME_FROM and --STOP_AFTER arguments to submit could make the function more intuitive.

I saw #144 and I am interested to help on it. One functionality I would like is to revert back to a previous workflow task and resume from there. For instance, the inversion fails at PERFORM_LINE_SEARCH due to wrong smoothing of the gradient. Then user modifies the smoothing parameters, calls revert --POST_PROCESS_EVENT_KERNELS, and resumes the inversion using submit --RESUME_FROM POST_PROCESS_EVENT_KERNELS. Nonetheless, I understand this functionality can complicate things.

bch0w commented

Agreed! Currently submit has --STOP_AFTER but not --RESUME_FROM, which would be useful if a User wanted to force the workflow to re-do tasks that had already been completed.

A revert function seems useful if it could undo changes caused by other tasks. In the past I have just used resume_from to start the workflow at a previous task and let it overwrite files that may have been created. This could cause issues during the line search though, where lots of files are updated at each step count.

I started addressing #144 in this branch but have been away from it since December so I'm not sure what progress I had made. Happy to have your help on improving this system to get things working smoother!

I think the biggest task in #144 that I wanted to complete was to checkpoint individual simulations. An example of this is if simulations 1-3 complete and simulation 4 fails, restarting the simulation task would require running 1-4 again. It would be great to find a way to checkpoint that 1-3 completed successfully and only 4 needs to be re-run. I thought about using the SPECFEM output_solver.txt file but perhaps their is a cleaner way to do that.

evcano commented

Hi @bch0w, I gladly will help with the checkpoint branch. I understand what you mean about wasting resources by re-running completed simulations. I addressed that problem on SeisFlows2 by adding a parameter RESUME_FROM_TASK in the parameters file. It indicated the initial task number (and therefore the initial source/event number) of the job array submitted by SLURM (i.e., --array=RESUME_FROM_TASK-NTASKS). I will try to find a cleaner way to do this

bch0w commented

Thanks @evcano! Glad to have your help on this.

One issue I see with the RESUME_FROM_TASK approach is if simulation jobs are run in parallel, and they fail out of order, e.g., running 1-4 and 2 fails but 1, 3 and 4 succeed, then resuming from 2 will end up rerunning 3 and 4 unnecessarily.

Another possibility is to that once a submitted task succeeds, it writes to the state file its job id or event id. That will allow us to keep track of which specific events need to be re-run. That approach is a bit fragile because if the state file gets deleted or corrupted in any way, that information is likely not recoverable.