This project is a proof of concept for running ad-hoc NCCL cliques within an existing Dask cluster.
It provides a full end-to-end example, which
- accepts Dask cuDF as input
- calls down to a mock C++ "cuML algorithm" through cython
- performs a collective reduce across ranks
- returns a Dask cuDF with the results of the reduction.
The included notebook (demo.ipynb
) walks through a simple example of creating a class that wraps the maintenance logic for a NcclClique
and performs the end-to-end demonstration at the end.
This demo uses LocalCUDACluster
to establish a cluster of workers with OPG access to the GPUs on a single node. Rather than using the LocalCUDACluster
, the Dask client object can connect to an existing Dask cluster, such as one with workers that span physical nodes. This will be tested in next steps.
Dask is used to broadcast the NCCL uniqueId
to the workers so that this proof of concept does not require a dependency on MPI. Next steps will also include demonstrating the use of UCX initialization using Dask workers.
Steps to running this demonstration:
-
You will need to have NCCL2 and the Cuda Toolkit installed and available on your library and include paths. You can install nccl2 in conda with:
conda install -c nvidia nccl
-
You can install cudatoolkit in your conda environment with:
conda install cudatoolkit==10.0.130
-
Check out the branch from the cuML comms pull request and build the C++ source, as outlined in
BUILD.md
inside the cuml codebase. -
The cuML NCCL Communicator will need to be built and installed. You can find it in the pull request from step #3. The build instructions are outlined in the comments of the cuML comms PR.
-
Set the
CUML_HOME
environment variable to the location of the cuML source code. -
To build the C++ and Cython portion of the demonstration, run the following in the project root directory:
python setup.py install
-
Install dask-cuda. It's easy to install this from source by running
python setup.py install
in the root directory of the repository.
Run the demonstration notebook by executing the jupyter notebook
or jupyter lab
commands in the project root directory and navigate to the demo.ipynb
notebook.
Running the demo on multiple nodes will require that the steps above are done on all hosts such that they are using the same Python, Dask, cuDF, Dask cuDF, NCCL, and cuML versions.
A Dask cluster consists of a single scheduler and some number of workers.
- You can a Dask scheduler on any of the hosts with the
dask-scheduler
command. - The
dask-cuda-worker <scheduler_address>
command (from thedask-cuda
repository) can be run once on each host and will start a single worker for each GPU on that host. You can limit the GPUs on any host using theLOCAL_CUDA_DEVICES
environment variable:LOCAL_CUDA_DEVICES=0,1,2 dask-cuda-worker <scheduler_address>