slurm-tools

This repo contains useful slurm tools including:

stui: A TUI to view slurm job queue and navigate log files. This works for any slurm-based jobs.
snapshot: A tool to run jobs with code isolation when on NFS systems
slogs: A tool to quickly view logs from submitit log directories
dashboard.py: A web interface for navigating submitit logs based on streamlit

Installation

As a User

Run: pipx install slurm-tools

Development

Install conda python and poetry.

Run conda create -n tools python=3.8
Run conda activate tools
Run conda install poetry
Run poetry install

STUI: Slurm Job Queue and Log Viewer

Snapshot Tool

This tool helps isolate experiments on NFS by:

Copying the contents of the current directory to another one, keyed either randomly or using a given identifier
Changing the current directory to that new directory
Executing the given command in the new directory

This helps to later reference what code is actually run, even if the source control version is changed (e.g., submit a long running experiment, continue coding while it runs, and need to reference original code). This also prevents situations in NFS where a running experiment may try to read newly modified code and crashes due to that (e.g., if the job is pre-empted and then rerun at a later time)

For example, you can run:

$ snapshot --experiment-id 42 'echo "my awesome experiment"'

Dashboard

When running, the dashboard looks like this:

The dashbaord works by inspecting the files contained within the directory specified by SLURM_DASHBOARD_DIR for files that follow the slurm logging format IDENTIFIER_log.out and IDENTIFIER_log.out, where IDENTIFIER can be the the job id and array id appended. This is the default format used by submitit, which is how I submit slurm commands, hence that choice.

You can configure something similar in your jobs in the sbatch submit file like so for non-array jobs:

#SBATCH --output=/log_dir/%j_log.out
#SBATCH --error=/log_dir/%j_log.err

or like so for array jobs

#SBATCH --output=/log_dir/%A_%a_log.out
#SBATCH --error=/log_dir/%A_%a_log.err

The dashboard uses this to list all slurm jobs run (except those that fail at launch, so have no log files, likely when the logdir doesn't exist). The dashboard does not automatically call squeue, but there is a button to load all the user's submitted jobs with better output format than the default. Similarly, for a given job, there is a button to retrieve sacct information, but this is not run by default and has to be user triggered. This design is to avoid overloading the slurm daemon with automated commands while making it easier for a human to view it. If jobs are continually running and you need to refresh the list of jobs, either reload the page or use streamlit's built in rerun button r.

Run with streamlit run dashboard.py

Configure via:

By setting environment SLURM_DASHBOARD_DIR