/nways_multi_gpu

N-Ways to Multi-GPU Programming

Primary LanguageC

N-Ways to Multi-GPU Programming

This repository contains mini applications for GPU Bootcamps. This bootcamp focuses on multi-GPU programming models.

Scaling applications to multiple GPUs across multiple nodes requires one to be adept at not just the programming models and optimization techniques, but also at performing root-cause analysis using in-depth profiling to identify and minimize bottlenecks. In this bootcamp, participants will learn to improve the performance of an application step-by-step, taking cues from profilers along the way. Moreover, understanding of the underlying technologies and communication topology will help us utilize high-performance NVIDIA libraries to extract more performance out of the system.

Bootcamp Outline

  • Overview of single-GPU code and Nsight Systems Profiler
  • Single Node Multi-GPU:
    • CUDA Memcpy and Peer-to-Peer Memory Access
    • Intra-node topology
    • CUDA Streams and Events
  • Multi-Node Multi-GPU:
    • Introduction to MPI and Multi-Node execution overview
    • MPI with CUDA Memcpy
    • CUDA-aware MPI
    • Supplemental: Configuring MPI in a containerized environment
  • NVIDIA Collectives Communications Library (NCCL)
  • NVHSMEM Library

Prerequisites

This bootcamp requires a multi-node system with multiple GPUs in each node (atleast 2 GPUs/ node).

Tutorial Duration

The total bootcamp material would take approximately 8 hours .

Using NVIDIA HPC SDK

A multi-node installation of NVIDIA's HPC SDK is desired. Refer to NVIDIA HPC SDK Installation Guide for detailed instructions. Ensure that your installation contains HPCX with UCX.

After installation, make sure to add HPC SDK to the environment as follows(For example the PATH highlighted below is for HPC SDK 21.5):

# Add HPC-SDK to PATH:
export PATH="<path-to-nvidia-hpc-sdk>/Linux_x86_64/21.5/compilers/bin:<path-to-nvidia-hpc-sdk>/Linux_x86_64/21.5/cuda/bin:$PATH"
# Add HPC-SDK to LD_LIBRARY_PATH:
export LD_LIBRARY_PATH="<path-to-nvidia-hpc-sdk>/Linux_x86_64/21.5/comm_libs/nvshmem/lib:<path-to-nvidia-hpc-sdk>/Linux_x86_64/21.5/comm_libs/nccl/lib:<path-to-nvidia-hpc-sdk>/Linux_x86_64/21.5/comm_libs/mpi/lib:<path-to-nvidia-hpc-sdk>/Linux_x86_64/21.5/math_libs/lib64:<path-to-nvidia-hpc-sdk>/Linux_x86_64/21.5/compilers/lib:<path-to-nvidia-hpc-sdk>/Linux_x86_64/21.5/cuda/extras/CUPTI/lib64:<path-nvidia-hpc-sdk>>/Linux_x86_64/21.5/cuda/lib64:$LD_LIBRARY_PATH"
#ADD NVSHMEM HOME DIRECTORY PATH
export CUDA_HOME=<path-to-nvidia-hpc-sdk>/Linux_x86_64/21.5/cuda
export NVSHMEM_HOME=<path-to-nvidia-hpc-sdk>/Linux_x86_64/21.5/comm_libs/nvshmem

Note: If you don't use Slurm workload manager, remove --with-slurm flag.

Then, install OpenMPI as follows:

# Download and extract OpenMPI Tarfile
wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.1.tar.gz
tar -xvzf openmpi-4.1.1.tar.gz
cd openmpi-4.1.1/
mkdir -p build
# Configure OpenMPI
./configure --prefix=$PWD/build --with-libevent=internal --with-xpmem --with-cuda=<path-to-nvidia-hpc-sdk>/Linux_x86_64/21.5/cuda/ --with-slurm --enable-mpi1-compatibility --with-verbs --with-hcoll=<path-to-nvidia-hpc-sdk>/Linux_x86_64/21.5/comm_libs/hpcx/hpcx-2.8.1/hcoll/lib --with-ucx=<path-to-nvidia-hpc-sdk>/Linux_x86_64/21.5/comm_libs/hpcx/hpcx-2.8.1/ucx/
# Install OpenMPI
make all install

Now, add OpenMPI to the environment:

export PATH="<path-to-openmpi>/build/bin/:$PATH"
export LD_LIBRARY_PATH="<path-to-openmpi/build/lib:$LD_LIBRARY_PATH"

Ensure that the custom-built OpenMPI is in use by running which mpirun which should point the mpirun binary in <path-to-openmpi>/build/bin directory.

Without Using NVIDIA HPC SDK

Multi-node compatible versions of the following are required:

Testing

We have tested all the codes with CUDA drivers 460.32.03 with CUDA 11.3.0.0, OpenMPI 4.1.1, HPCX 2.8.1, Singularity 3.6.1, NCCL 2.9.9.1, and NVSHMEM 2.1.2. Note that OpenMPI in our cluster was compiled with CUDA, HCOLL, and UCX support.

Running Jupyter Lab

As this bootcamp covers multi-node CUDA-aware MPI concepts, it is primarily designed to run without any containers. After the prerequisite softwares have been installed, follow these steps to install and run Jupyter Lab:

# Install Anaconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh 
bash Miniconda3-latest-Linux-x86_64.sh -b -p <my_dir>
# Add conda to PATH
export PATH=$PATH:<my_dir>/bin/
# Install Jupyter Lab
conda install -c conda-forge jupyterlab
# Run Jupyter Lab
jupyter lab --notebook-dir=<path-to-gpubootcamp-repo>/hpc/multi_gpu_nways/labs/ --port=8000 --ip=0.0.0.0 --no-browser --NotebookApp.token=""

After running Jupyter Lab, open http://localhost:8888 in a web browser and start the introduction.ipynb notebook.

Optional: Containerized Build with Singularity

This material is designed to primarily run in containerless environments, that is, directly on the cluster. Thus, building the Singularity container is OPTIONAL.

If containerization is desired, follow the steps outlined in the notebook MPI in Containerized Environments.

Follow the steps below to build the Singularity container image and run Jupyter Lab:

# Build the container
singularity build multi_gpu_nways.simg Singularity
# Run Jupyter Lab
singularity run --nv multi_gpu_nways.simg jupyter lab --notebook-dir=<path-to-gpubootcamp-repo>/hpc/multi_gpu_nways/labs/ --port=8000 --ip=0.0.0.0 --no-browser --NotebookApp.token="" 

Then, access Jupyter Lab on http://localhost:8888.

Known issues

Compiler throws errors

If compiling any program throws an error related to CUDA/ NCCL/ NVHSMEM/ MPI libraries or header files being not found, ensure that LD_LIBRARY_PATH is correctly set. Moreover, make sure environment variables CUDA_HOME, NCCL_HOME, and NVSHMEM_HOME are set either during installation or manually inside each Makefile.

  • Please go through the list of exisiting bugs/issues or file a new issue at Github.

Questions?

Please join OpenACC Slack Channel to raise questions.

If you observe any errors or issues, please file an issue on GPUBootcamp GitHuB repository.