FLAMEGPU/FLAMEGPU2

Simulation Checkpointing

Opened this issue · 4 comments

(I couldn't find an explicit issue about this).

Resuming long running simulations (or especially ensembles, see #807) would be useful.

I.e. to support pre-emptable jobs on HPC, simulations which may hit runtime limits on some HPC, or to finish a mostly complete simulation which was interupted for some reason, such as power loss.

For simple models without device RNG this should be achievable already via input / otuput states, with careful handling of simulation initialisation (i.e. loading from disk), but more complex cases would not be correct, and may also be impacted by non-determinism with FLAME GPU 2.

  • RNG state will need to be saved to disk and resumed
    • but possibly overridden with fresh state, simualtions which run from a "warmed-up" state. I.e. a traffic sim which is ran for a while to populate the environment prior to multiple runs with different RNG to model alternate scenarios.
  • Use of atomics will prevent this being exactly reproducible, may need #417 before this is truly useful.
  • Initialisation will need careful handling in models that support this mode.

This would probably want to be an opt-in feature, as users will need to make sure thier models are compatible.

It might also be worth onyl resuming if requested (i.e. binary.exe --resume path/to/checkpoint.json) or something to that effect, or possibly a --checkpoint flag which will run with checkpointing enabled, and resume if a matching checkpoint state is found.

The importance of checkpointing on cloud jobs was highlighted during the N8 DRI retreat, as supporting checkpointing is important with cheaper forms of cloud compute which can be pre-empted by other jobs.

Checkpointing every N units of time might be more useful than every N iterations for jobs, as users will (hopefully) have a better idea of how much time they are willing to lose rather than how many itertions worth of simulation.

Slurm (and presumably other schedulers) have signal-based mechanisms when handling job termination and preemption which may also be useful for non-periodic checkpointing, but they will require some user knowledge and be slurm config specific.

sbatch --signal=[{R|B}:]<sig_num>[@sig_time] can be used to signal N units of seconds before the end of the job, the sim can then handle that signal and write out a checkpoint. This would need to be set to a large enough time for the checkpointing process to complete (which will depend on the population, how long an iteration takes (as mid iteration might not be a good idea) etc.

When a job is within sig_time seconds of its end time, send it the signal sig_num. Due to the resolution of event handling by Slurm, the signal may be sent up to 60 seconds earlier than specified. sig_num may either be a signal number or name (e.g. "10" or "USR1"). sig_time must have an integer value between 0 and 65535. By default, no signal is sent before the job's end time. If a sig_num is specified without any sig_time, the default time will be 60 seconds. Use the "B:" option to signal only the batch shell, none of the other processes will be signaled. By default all job steps will be signaled, but not the batch shell itself. Use the "R:" option to allow this job to overlap with a reservation with MaxStartDelay set. To have the signal sent at preemption time see the preempt_send_user_signal SlurmctldParameter.

Jobs which get pre-empted will be issues SIGCONT and SIGTERM signals in advance of termination, if the cluster is configured to have a GraceTime.
This defaults to 0 though, so might not be useful in a lot of places, or might not be long enough.

GraceTime: Specifies a time period for a job to execute after it is selected to be preempted. This option can be specified by partition or QOS using the slurm.conf file or database respectively. This option is only honored if PreemptMode=CANCEL. The GraceTime is specified in seconds and the default value is zero, which results in no preemption delay. Once a job has been selected for preemption, its end time is set to the current time plus GraceTime. The job is immediately sent SIGCONT and SIGTERM signals in order to provide notification of its imminent termination. This is followed by the SIGCONT, SIGTERM and SIGKILL signal sequence upon reaching its new end time.

An alternative might be a more user-involved enabling of checkpointing, requiring a step function to be implemented where they can call a saveCheckpoint method or similar. This doesn't place as nicely with signals however (in the case of simulations with long running steps, or very large state to write to disk).

Resuming from a saved checkpoint might require a different entry path too, to avoid the execution of init functions for instance.

This should be doable without an API break however (via overloaded / optional methods?) / purely additions.

Fujitsu raised interest in simulation checkpointing.