Multinode NLP Training with PyTorch NGC Container

Setup

1. Container image

Build from a Dockerfile and push to Docker Hub

Dockerfile sample: https://github.com/tlkh/pytorch-ngc-multinode-nlp/blob/main/Dockerfile

This Dockerfile uses a PyTorch NGC image as a base, and then:

Install Open MPI
Configure SSH
Install Horovod (optional)
Install deepspeed (required)
Install HuggingFace transformers library (and some other optional extras)

Use singularity pull to pull Docker container and convert to Singularity image

2. PBS Script

Setup resource request
Set up environment variables MASTER_ADDR and PATH (required for PyTorch to function correctly in some environments when the user is remapped inside the container)
Choose Singularity image image
mpirun command with required parameters

Reference for mpirun:

mpirun \ 
   -n NUM_OF_GPU --hostfile $PBS_NODEFILE \
   --mca pml ob1 --mca btl tcp,self,vader
   --mca btl_tcp_if_include bond0 \
   --mca btl_openib_warn_default_gid_prefix 0 \
   -bind-to none -map-by slot \
   -x NCCL_IB_GID_INDEX=3 \
   -x NCCL_CHECKS_DISABLE=1 -x NCCL_IB_DISABLE=0 \
   -x NCCL_IB_HCA=mlx5_bond_0 -x NCCL_IB_CUDA_SUPPORT=1 \
   /app/singularity/3.5.3/bin/singularity exec --nv $image \
   python SCRIPT.py

PBS script sample: https://github.com/tlkh/pytorch-ngc-multinode-nlp/blob/main/nlp.qsub

NLP Training Script

HuggingFace training

For example, see: https://github.com/tlkh/pytorch-ngc-multinode-nlp/blob/main/nlptest.py for NLP training (text classification) with HuggingFace library. When using HuggingFace's Trainer class, you do not need to specify additional arguments to enable multi-GPU or multi-node training.

Note that for convenience, the script will download the dataset and model weights run it is first run. If you are testing this, please run it as a single GPU job first to download the dataset and model weights.

Note: Horovod is not used here.