Dask + NCCL + cuML Comms PoC

This project is a proof of concept for running ad-hoc NCCL cliques within an existing Dask cluster.

It provides a full end-to-end example, which

accepts Dask cuDF as input
calls down to a mock C++ "cuML algorithm" through cython
performs a collective reduce across ranks
returns a Dask cuDF with the results of the reduction.

The included notebook (demo.ipynb) walks through a simple example of creating a class that wraps the maintenance logic for a NcclClique and performs the end-to-end demonstration at the end.

This demo uses LocalCUDACluster to establish a cluster of workers with OPG access to the GPUs on a single node. Rather than using the LocalCUDACluster, the Dask client object can connect to an existing Dask cluster, such as one with workers that span physical nodes. This will be tested in next steps.

Dask is used to broadcast the NCCL uniqueId to the workers so that this proof of concept does not require a dependency on MPI. Next steps will also include demonstrating the use of UCX initialization using Dask workers.

Running the Demo

Steps to running this demonstration:

You will need to have NCCL2 and the Cuda Toolkit installed and available on your library and include paths. You can install nccl2 in conda with: conda install -c nvidia nccl
You can install cudatoolkit in your conda environment with: conda install cudatoolkit==10.0.130
Check out the branch from the cuML comms pull request and build the C++ source, as outlined in BUILD.md inside the cuml codebase.
The cuML NCCL Communicator will need to be built and installed. You can find it in the pull request from step #3. The build instructions are outlined in the comments of the cuML comms PR.
Set the CUML_HOME environment variable to the location of the cuML source code.
To build the C++ and Cython portion of the demonstration, run the following in the project root directory: python setup.py install
Install dask-cuda. It's easy to install this from source by running python setup.py install in the root directory of the repository.

Run the demonstration notebook by executing the jupyter notebook or jupyter lab commands in the project root directory and navigate to the demo.ipynb notebook.

Multi-Node Demonstration

Running the demo on multiple nodes will require that the steps above are done on all hosts such that they are using the same Python, Dask, cuDF, Dask cuDF, NCCL, and cuML versions.

A Dask cluster consists of a single scheduler and some number of workers.

You can a Dask scheduler on any of the hosts with the dask-scheduler command.
The dask-cuda-worker <scheduler_address> command (from the dask-cuda repository) can be run once on each host and will start a single worker for each GPU on that host. You can limit the GPUs on any host using the LOCAL_CUDA_DEVICES environment variable: LOCAL_CUDA_DEVICES=0,1,2 dask-cuda-worker <scheduler_address>

cjnolet/dask-nccl

Dask + NCCL + cuML Comms PoC

Running the Demo

Multi-Node Demonstration