This repository showcases how to run deep learning code on NeSI.
After logging on NeSI, clone the repository:
git clone https://github.com/nesi/dl_demos
then create the Conda environment using:
cd dl_demos
module purge && module load Miniconda3
source $(conda info --base)/etc/profile.d/conda.sh
export PYTHONNOUSERSITE=1
conda env create -f environment.yml -p ./venv
Submit a Slurm job to train a CNN model on CIFAR-10 using
sbatch train_model.sl
and check that the job is running with squeue --me
.
While the job is running, log in the node with ssh NODE
and use nvidia-smi
to check that the script is using the GPU.
From a Jupyter session, start a tensorboard
instance pointing at the results folder
module purge && module load Miniconda3
source $(conda info --base)/etc/profile.d/conda.sh
export PYTHONNOUSERSITE=1
conda activate ./venv
tensorboard --logdir=results/JOBID-train_model.sl/logs
and use the Jupyter Server Proxy to access it at https://jupyter.nesi.org.nz/user-redirect/proxy/6006/
Alternatively, start an interactive session
srun --account=nesi99999 --time=10 --cpus-per-task=2 --mem=8GB \
--gpus-per-node=A100-1g.5gb:1 --pty bash
then, on the compute node, load the modules and activate the Conda environment
module purge
module load Miniconda3/22.11.1-1 cuDNN/8.1.1.33-CUDA-11.2.0
source $(conda info --base)/etc/profile.d/conda.sh
export PYTHONNOUSERSITE=1
conda deactivate
conda activate ./venv
and run the script on the node
python train_model.py results/scratch
Note: Add --qos=debug
when using sbatch
or srun
to get higher priority for short-lived low resources jobs.
To use the notebook, create a Jupyter kernel using the Conda environment and loading the CUDA/cuDNN toolboxes
module purge && module load JupyterLab
nesi-add-kernel -p ./venv -- dl_demos cuDNN/8.1.1.33-CUDA-11.2.0
then (re)start a Jupyter session, making sure to select a GPU, and run the train_model.ipynb notebook.