slurm_simulation_toolkit

By Guillaume Perrault-Archambault

Disclaimer

This repository is a work in progress. Currently, the job launching script (mini_regression.sh) and its descendant scripts (simulation.sh, slurm.sh, simulation.sbatch) are ready to be beta tested by users.

The regressions monitoring script (regression_status.sh) and the result processing script (process_results.sh) are not ready.

The scripts have mainly been tested on Compute Canada clusters, and mostly on GPU nodes.

Please open an issue if you find a bug or notice that the toolkit does not behave as intended.

Introduction

This toolkit provides an automated command-line workflow for launching batches of SLURM jobs (mini_regression.sh), monitoring these regressions (regression_status.sh), and post-processing regression logs to summarize results (process_results.sh).

The toolkit is designed to work 'as is' without modification by the user. That said, it is designed in a modular way, such that job-specific configurations can be overridden (by supplying your own SLURM_SIMULATION_TOOLKIT_JOB_RC_PATH), and the get_local_cluster.sh script can be overridden (by supplying your own SLURM_SIMULATION_TOOLKIT_GET_CLUSTER) if the user wishes to add support for new/unsupported cluster.

Requirements

The scripts were originally designed and tested using bash 4.3.48, and SLURM 17.11.12. These and newer versions of bash and SLURM are supported.

Older versions of bash/SLURM will likely work, but are not officially supported.

Features

Parallel job launching: mini_regression.sh can handle launching hundreds of jobs in parallel within seconds.
Sandbox simulations: all simulations are run in a separate autogenerated directories and do not conflict with each other (eg scripts may write to a file with the same name in their output directory).
Snapshotting source code: The user's source code directory is copied to the autogenerated output, and it is this copied version which is executed. This flow ensures that users can continue editing their source code with affecting pending and running jobs.
Regression monitoring: regression_status.sh automatically reports the status of a regression (running, completed, failed) with a breakdown of each job.
Reproducible simulations: Snapshotting further ensures that simulations are fully reproducible, since all source code is snapshotted at time of the regression launch. The regression command, slurm commands and simulation output are all logged, allowing the user to retrieve any arguments and parameters used in a given simulation.
Option to enforce a maximum number jobs in parallel for the current user -- useful for clusters in Beta testing without fairshare system
Argument cascading: arguments following -- are cascaded down to the user's base script, ensuring that the user does not need to modify the toolkit itself to pass down arguments.
Automatic generation of regression cancellation script: this autogenerated script kills the appropriate SLURM jobs if and when the user decides cancel their regression. This both saves the user's time in tracking down running jobs to cancel, as well as helps maximize the use of computer resources for other users.
Option to enforce a maximum of number of jobs in parallel for the current user. This is useful for SLURM systems that don't use a fairshare system (eg. in Beta testing phase of a new cluster.)
Option to run multiple simulations per GPU: this helps maximize use of compute resources when GPU memory exceeds the model's needs (eg I found that ResNet with 128 batch size uses fewer GPU hours when running 2 processes per GPU on a 32GB GPU).
Configurability: users can override default job parameters by supplying their own default job parameters, and use their own get_local_cluster.sh. See and SLURM_SIMULATION_TOOLKIT_JOB_RC_PATH and SLURM_SIMULATION_TOOLKIT_GET_CLUSTER in the install instructions.

Currently supported clusters

Graham
Cedar
Beluga
Niagara
Beihang Dell cluster (referred to as "Beihang" in the code)

Install instructions

git clone https://github.com/gobbedy/slurm_simulation_toolkit <PATH_TO_TOOLKIT>

Regression setup instructions

Every time you open a new shell, set and export the following environment variables:

SLURM_SIMULATION_TOOLKIT_HOME should be set to <PATH_TO_TOOLKIT>.
SLURM_SIMULATION_BASE_SCRIPT is the path to base script that the user wishes to execute on a SLURM compute node.
SLURM_SIMULATION_TOOLKIT_REGRESS_DIR is the base directory beneath which simulation output directories and regression summary directories will autogenerated. The default value in user_template.rc is likely correct for most Compute Canada users.
SLURM_SIMULATION_TOOLKIT_JOB_RC_PATH is the path to an RC file which contains default SLURM job parameters. Since these parameters can be overridden from the command-line, creating your own RC is not strictly required.
SLURM_SIMULATION_TOOLKIT_GET_CLUSTER points to a script that outputs the name of the local cluster. The default value in user_template.rc is likely correct for Compute Canada users.
SLURM_SIMULATION_TOOLKIT_SBATCH_SCRIPT_PATH is the path to the .sbatch file passed to the sbatch command. This file wraps the user's base script. The default script pointed to in user_template.rc is intended to be correct for most users, but will likely to not fit all usage models.

You may set these variables by sourcing an rc file in your shell.

A template rc file setting all the above variables can be found here: <PATH_TO_TOOLKIT>/user_template.rc

You may copy <PATH_TO_TOOLKIT>/user_template.rc to any desired location <DESIRED_PATH_TO_RC> and modify the file contents as desired.

Then simply run: source <PATH_TO_RC>

WARNING: please do NOT store large amounts of data in the parent directory of your base script (including in any of its subdirectories), since this directory will be copied to the output directory for shapshotting.

For the same reason, please do NOT set SLURM_SIMULATION_TOOLKIT_REGRESS_DIR to any path beneath the parent directory of your base script.

Example: Running 12 jobs in parallel

mini_regression.sh --num_simulations 12 -- --epochs 200 --batch_size 128

The above assumes that your base script (located wherever SLURM_SIMULATION_BASE_SCRIPT points to) accepts a --epochs <NUM_EPOCHS> option and a --batch_size <BATCH_SIZE> option.

Note that toolkit parameters (here --num_simulations 12) are separated from base script parameters (here --epochs 200 --batch_size 128) with --.

Sample output:

RUNNING:
mini_regression.sh --num_simulations 12 -- --epoch 200 --batch_size 128

JOB IDs FILE IN: /lustre03/project/6004260/gobbedy/regress/regression_summary/dat_Jun05_062358/job_manifest.txt
SLURM COMMANDS FILE: /lustre03/project/6004260/gobbedy/regress/regression_summary/dat_Jun05_062358/slurm_commands.txt
REGRESSION CANCELLATION SCRIPT: /lustre03/project/6004260/gobbedy/regress/regression_summary/dat_Jun05_062358/cancel_regression.sh
REGRESSION COMMAND FILE: /lustre03/project/6004260/gobbedy/regress/regression_summary/dat_Jun05_062358/regression_command.txt
SLURM LOGFILES MANIFEST: /lustre03/project/6004260/gobbedy/regress/regression_summary/dat_Jun05_062358/slurm_log_manifest.txt
SIMULATION LOGS MANIFEST: /lustre03/project/6004260/gobbedy/regress/regression_summary/dat_Jun05_062358/log_manifest.txt
HASH REFERENCE FILE: /lustre03/project/6004260/gobbedy/regress/regression_summary/dat_Jun05_062358/hash_reference.txt
HASH REFERENCE: beluga@638b8668

Run mini_regression.sh --help for more details on usage.

Description of each script

slurm.sh

Wraps sbatch SLURM command. Also supports srun and salloc in theory, but only sbatch is thoroughly tested.

Handles low-level SLURM switches and parameters that do not need to be exposed to the user.

Run slurm.sh --help for usage.

simulation.sh

Wraps slurm.sh. Handles generating simulation output directory and copying source code to the output directory. The simulation will be run from within the output directory.

Run simulation.sh --help for usage.

mini_regression.sh

Wraps simulation.sh. Handles launching multiple simulations in parallel. Will generate a regression summary directory containing job ID manifest, logfile manifest, slurm logfile manifest, slurm commands, regression command, hash reference file, and hash reference.

Run mini_regression.sh --help for usage.

quettabit/slurm_simulation_toolkit