NVSHMEM is a parallel programming interface based on OpenSHMEM that provides efficient and scalable communication for NVIDIA GPU clusters. NVSHMEM creates a global address space for data that spans the memory of multiple GPUs and can be accessed with fine-grained GPU-initiated operations, CPU-initiated operations, and operations on CUDA® streams.
see NVIDIA
NVSHMEM is a parallel programming model for efficient and scalable communication across multiple NVIDIA GPUs. NVSHMEM, which is based on OpenSHMEM, provides a global address space for data that spans the memory of multiple GPUs and can be accessed with fine-grained GPU-initiated operations, CPU-initiated operations, and operations on CUDA streams. NVSHMEM offers a compelling multi-GPU programming model for many application use cases, and is especially valuable on modern GPU servers that have a high density of GPUs per server node and complex interconnects such as NVIDIA NVSwitch on the NVIDIA DGX A100 server.
~$ echo
$NVSHMEM_HOME/ ~$ /opt/share/ucx/1.12.0-cuda-11.6-ofed-5.4/
~$ module load nvshmem/2.8.0
~$ module list ~$ Currently Loaded Modulefiles:
- gcc/11.1.0 3) ucx/1.13.1-cuda-12.0 5) nvshmem/2.8.0
- cuda/12.0 4) openmpi/4.1.4-gcc-cuda-12.0
#include <stdio.h>
#include <cuda.h>
#include <nvshmem.h>
#include <nvshmemx.h>
__global__ void simple_shift(int *destination)
{
int mype = nvshmem_my_pe();
int npes = nvshmem_n_pes();
int peer = (mype + 1) % npes;
nvshmem_int_p(destination, mype, peer);
}
int main(int argc, char **argv)
{
int mype_node, msg;
cudaStream_t stream;
nvshmem_init();
mype_node = nvshmem_team_my_pe(NVSHMEMX_TEAM_NODE);
cudaSetDevice(mype_node);
cudaStreamCreate(&stream);
int *destination = (int *) nvshmem_malloc(sizeof(int));
simple_shift<<<1, 1, 0, stream>>>(destination);
nvshmemx_barrier_all_on_stream(stream);
cudaMemcpyAsync(&msg, destination, sizeof(int), cudaMemcpyDeviceToHost, stream);
cudaStreamSynchronize(stream);
printf("%d: received message %d\n", nvshmem_my_pe(), msg);
nvshmem_free(destination);
nvshmem_finalize();
return 0;
}
~$ nvcc -rdc=true -ccbin g++ -gencode=arch=compute_70,code=sm_70 -I $NVSHMEM_HOME/include -L $NVSHMEM_HOME/lib helloWorld_nvshmem.cu -o helloWorld_nvshmem -lnvidia-ml -lcudart -lnvshmem -lcuda
~$ nvshmrun -n 4 ./helloWorld_nvshmem
NCCL requires at least CUDA 12.0 and Kepler or newer GPUs. For InfiniBand GPUDirect Async (IBGDA) based platforms, best performance is achieved when all GPUs are located on multi-socket configurations.