
How to get the combination of GPU, Apptainer/Singularity and PyTorch working on DelftBlue

Combination of GPU, Apptainer/Singularity and PyTorch working on DelftBlue

Running on GPUs and using Apptainer on DelftBlue:

Apptainer general usage and GPUs:

Docker container:

PyTorch example:

Steps on DelftBlue

Move to your scratch directory, because this will take a lot of space. Make a temporary directory for Apptainer to use when building a .sif file from a Docker image.

cd /scratch/$USER
mkdir tmpdir

Set up Apptainer environment variables.

export APPTAINER_TMPDIR=/scratch/$USER/tmpdir
export APPTAINER_CACHEDIR=/scratch/$USER/.apptainer/cache

Adding these two export lines to the end of ~/.bashrc makes life easier.

Clone this repository and move into it:

git clone git@github.com:sebranchett/delftblue-gpu-apptainer-pytorch.git
cd delftblue-gpu-apptainer-pytorch

Find the version of the image you want to use. In my case it was pytorch:22.10-py3. Pull the (Docker) image with Apptainer.

apptainer pull docker://nvcr.io/nvidia/pytorch:22.10-py3

This took nearly 2 hours and produces a file pytorch_22.10-py3.sif.

Find your account on DelftBlue:

sacctmgr list -sp user $USER

You will probably have access to 'innovation' and your departmental account. In the quickstart.sh file, edit the line:

#SBATCH --account=research-uco-ict

to an account you have access to.

Submit the job:

sbatch quickstart.sh

The job only takes a couple of minutes, once it has started. On completion, the file quickstart.log should end with:

Saved PyTorch Model State to model.pth
Predicted: "Ankle boot", Actual: "Ankle boot"

as described in the PyTorch documentation.

Please help

In the slurm-nnnnnnn.out file I got:

13:4: not a valid test operator: (
13:4: not a valid test operator: 520.61.05

at the end. If anyone knows why, please let me know.

If you have corrections or improvements to this repository, please contribute an issue or a pull request. It would be much appreciated.