RCCL
ROCm Communication Collectives Library
Introduction
RCCL (rickle) is implementation of MPI communication apis on ROCm enabled GPUs. It is a collective communication library whose aim is to provide low-latency and high-bandwidth communication on dense GPU systems. RCCL launches special-purpose compute kernels for parallel overlapping transfers. This involves distributed processing and exchanging data between participating peer-accessible GPUs in a logical ring within a single multi-GPU node.
Supported APIs
- AllReduce
- Broadcast
Requirements
- ROCm supported GPUs
- ROCm stack installed on the system (HCC)
Build
RCCL directly depends on HIP runtime & HCC C++ compiler which are part of the ROCm SW stack.
git clone https://github.com/ROCmSoftwarePlatform/rccl.git
cd rccl/src
make lib
sudo make install
RCCL install requires sudo/root access because it creates a directory called "rccl" under /opt/rocm/. This is an optional step and RCCL can be used directly by including the path containing librccl.so.
Run
rccl
library install directory should be added to LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/opt/rocm/rccl/lib:$LD_LIBRARY_PATH
Usage
#include <rccl.h>
#include <vector>
int main() {
int numGpus;
hipGetDeviceCount(&numGpus);
std::vector<rcclComm_t> comms(numGpus);
rcclCommInitAll(comms, numGpus);
std::vector<float*> sendBuff(numGpus);
std::vector<float*> recvBuff(numGpus);
std::vector<hipStream_t> streams(numGpus);
// Set up sendBuff and recvBuff on each GPU
// Create stream on each GPU
for(int i=0;i<numGpus;i++) {
hipSetDevice(i);
rcclAllReduce(sendBuff[i], recvBuff[i], size, rcclFloat,
rcclSum, comms[i], streams[i]);
}
}
Source Layout
inc
- contains the public RCCL header exposing the RCCL interfacessrc
- contains source code for the implementation of the RCCL APIstests
- contains unit tests cases to validate RCCL
Source Naming
The RCCL library consists of two primary layers:
Interface layer
- rccl.h - C99 APIs as defined by the RCCL library.
- rccl.cpp - The interface layer implementation encapsulates the functionality by invoking the actual primitive specific C++ template functions.
RCCL primitive specific implementations and kernels
- rcclDataTypes.h
- rcclTracker.h
- rccl{Primitive}Runtime.h
- rccl{Primitive}Kernels.h
- rcclKernelHelper.h
This communication primitive layer contains implementations the broadcast and all-reduce distributed designs. It takes into account of the number of GPUs in the system, the data size and divide the work into smaller chunks. The compute kernels are launched to implement a distributed pipelined operation that achieves overall maximum throughput when compared to a single root processing and sharing the data serially.
Caveats
The initial implementation of the distributed broadcast and all-reduce designs are focused on the functionality and correctness and not tuned yet to obtain optimal performance for a specific input size and GPU count. Better strategies to determine optimal chunk sizes to allow overlapping of transfers for better pipelining are being explored.