Implement two collective operations: Reduce
and Broadcast
in a single node
int intraReduce(const int* sendbuff, int* recvbuff, size_t count, int root, ncclComm_t comm, cudaStream_t stream);
int intraBroadcast(const int* sendbuff, int* recvbuff, size_t count, int root, ncclComm_t comm, cudaStream_t stream);
To build the tests, just type make
.
If CUDA is not installed in /usr/local/cuda, you may specify CUDA_HOME. Similarly, if NCCL is not installed in /usr, you may specify NCCL_HOME.
By now the tests rely on MPI, and if MPI is not installed in /usr/local/mpi, you may specify MPI_HOME.
$ make CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl MPI_HOME=/path/to/mpi
Run on a single node with 4 GPUs
$ mpirun -np 4 ./build/intra