slurm-demo: A Shell repository from SWC/GCNU Neuroinformatics Unit

This repo is now archived, and the materials have been moved here.

SWC SLURM demo

ssh USERNAME@ssh.swc.ucl.ac.uk

ssh hpc-gw1

N.B. You can set up SSH keys so you don't have to keep typing in your password.

See what nodes are available using sinfo.

N.B. Adding this line to your .bashrc and typing sinfol will print much more information.

alias sinfol='sinfo --Node --format="%.14N %.4D %5P %.6T %.4c %.10C %4O %8e %.8z %.6m %.8d %.6w %.5f %.6E %.30G"'

See what jobs are running with squeue.
See what data storage is available:

cd /ceph/store - general lab data storage
cd /nfs/winstor - older lab data
/ceph/neuroinformatics/neuroinformatics - team storage
cd /ceph/zoo - specific project storage
cd /ceph/scratch - short term storage, e.g. for intermediate analysis results, not backed up
cd /ceph/scratch/neuroinformatics-dropoff - "dropbox" for others to share data with the team

Start an interactive job in pseudoterminal mode (--pty) by requesting a single core from SLURM, the job scheduler:

srun -p cpu --pty bash -i

N.B. Don't run anything intensive on the login nodes.

Clone a test script

cd ~/
git clone https://github.com/neuroinformatics-unit/slurm-demo

Check out list of available modules

module avail

Load the miniconda module

module load miniconda

Create conda environment

conda env create -f env.yml

Activate conda environment and run Python script

conda activate slurm_demo
python python multiply.py 5 10 --jazzy

Stop interactive job

exit

Check out batch script:

cat batch_example.sh

#!/bin/bash

#SBATCH -p gpu # partition (queue)
#SBATCH -N 1   # number of nodes
#SBATCH --mem 2G # memory pool for all cores
#SBATCH -n 2 # number of cores
#SBATCH -t 0-0:10 # time (D-HH:MM)
#SBATCH -o slurm_output.out
#SBATCH -e slurm_error.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=adam.tyson@ucl.ac.uk

module load miniconda
conda activate slurm_demo

for i in {1..5}
do
  echo "Multiplying $i by 10"
  python multiply.py $i 10 --jazzy
done

Run batch job:

sbatch batch_example.sh

Check out array script:

cat array_example.sh

#!/bin/bash

#SBATCH -p gpu # partition (queue)
#SBATCH -N 1   # number of nodes
#SBATCH --mem 2G # memory pool for all cores
#SBATCH -n 2 # number of cores
#SBATCH -t 0-0:10 # time (D-HH:MM)
#SBATCH -o slurm_array_%A-%a.out
#SBATCH -e slurm_array_%A-%a.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=adam.tyson@ucl.ac.uk
#SBATCH --array=0-9%4

# Array job runs 10 separate jobs, but not more than four at a time.
# This is flexible and the array ID ($SLURM_ARRAY_TASK_ID) can be used in any way.

module load miniconda
conda activate slurm_demo

echo "Multiplying $SLURM_ARRAY_TASK_ID by 10"
python multiply.py $SLURM_ARRAY_TASK_ID 10 --jazzy

Run array job:

sbatch array_example.sh

Start an interactive job with one GPU:

srun -p gpu --gres=gpu:1 --pty bash -i

Load CUDA

module load cuda

Activate conda environment check GPU

conda activate slurm_demo
python

import tensorflow as tf
tf.config.list_physical_devices('GPU')

For fast I/O consider copying data to /tmp (fast NVME storage) as part of the run. Available on all of the gpu-380 and gpu-sr670 nodes.