NOAA-GFDL/MDTF-diagnostics

Request Slurm RuntimeManager

Opened this issue · 0 comments

What problem will this feature solve?
The framework currently runs PODs in separate subprocesses on a single processor, relying on the OS's scheduling to share compute and memory. This is inadequate for analysis of larger volumes of data, e.g. as generated during current high-resolution runs of GFDL CM4/MOM6. Functionality is needed to scale multi-POD execution beyond the limitations of a single node.

This request is not especially urgent, but does reflect a real-world use case. The workaround currently suggested for GFDL is to submit a batch job for each POD as a separate run of the framework, and to collect and reorganize the output manually.

Describe the solution you'd like
The proposal made in this issue is to implement a RuntimeManager that submits each POD to a slurm scheduler as a separate batch job. As with the current SubprocessRuntimeManager, the framework's execution would block until all jobs complete or return errors, with the status being logged, and the output from each POD would be written to a separate directory in the overall MDTF_* output directory for the run.

The user would select between this RuntimeManager and the SubprocessRuntimeManager via the --runtime_manager setting, using the existing plug-in mechanism. CLI options specific to this RuntimeManager should allow the user to pass arbitrary directives (e.g. requested run time) through to sbatch, although some of these (working directory, path to stdout/stderr) should be set by the framework.

Aspects of the implementation would necessarily be site-specific, e.g. to allow for use of different file transfer protocols between nodes of the cluster, and to make use of shared filesystems, if any.

Another site-specific detail would be whether the pre-processed model input data could be placed on a filesystem that's mounted on all the nodes, or whether the input data for each POD would need to be transferred to the node responsible for running it. This latter scenario is more general, but more complicated, as it requires communication between the POD batch job, the scheduler, and the framework's process.

Describe alternatives you've considered
N/A

Additional context