adjtomo/seisflows

improve and optimize checkpointing/state system

bch0w opened this issue · 5 comments

bch0w commented

The checkpointing system is pretty rudimentary at the moment, it simply checks if a function in workflow.task_list has been run (by name) and skips the function if it has. This is not optimal as it means a lot of repeated simulations will take place. E.g., if 60 forward simulations are launched and all but one finishes, restarting the workflow will cause all 60 jobs to be re-run.

Additionally, there is no clear way to interact with the state file. At the moment it involves directly editing the text file, but that is a bit clunky and prone to error. It would be great to have a command line tool to check, edit and clear a statefile. The state file should likely also be a hidden file to prevent casual users from accidentally deleting or editing it manually.

Tasks to be completed are then:

  • Improve task-level checkpointing (this can be done by checking solver logs, or generated waveforms in the 'traces' dir.)
  • Create a seisflows state command line option to manipulate the state file
  • Generate the entire state file at the setup stage, as opposed to building it throughout the first iteration
  • Make the state file a hidden file
evcano commented

Hi @bch0w, I implemented a task-level checkpoint system. I have been testing it during the past months and it seems to work pretty well. I will submit a pull request this week!

evcano commented

Hi @bch0w, I implemented a task-level checkpoint system. I have been testing it during the past months and it seems to work pretty well. I will submit a pull request this week!

bch0w commented

Hi @evcano, just wondering if you have any updates on your task-level checkpoint system? No pressure, just something I have been thinking about recently and I would be excited to see your work on it!

evcano commented

Hi @bch0w,

I am cleaning up the changes I made. I will submit the pull request tomorrow!