Herald

Herald is a high-performance distributed deep learning training system targeting for embedding models. It optimizes the embedding transmissions between workers and parameter servers with embedding scheduling.

Herald is built on top of Hetu with some extra optimization efforts:

upgrading to CUDA 11
adding RDMA support

How to run Herald

Prerequsites

See CMakeLists.txt for the required packages and libraries. In general, Herald requires

cmake 3.18.0
pybind11 2.6.1
ZMQ 4.3.2
OpenMPI 4.0.3
protobuf 3.15.8
NCCL 2.8

We also provide a Dockerfile to build the environment for Herald.

Build HET and Herald

mkdir build
cd build
cmake ..
make -j

Simple example

See the simple distributed configuration file here.

Before running the experiment, your need to correctly configure the DMLC and NCCL environment variables and other mpi related parameters in the runner.py and run_hetu.py files. For example, you need to set the DMLC_INTERFACE to the network interface that the workers and the parameter server can communicate with each other. You also need to set the NCCL_SOCKET_IFNAME to allow NCCL find the correct network interface.

Besides, make sure you have placed the dataset at examples/datasets/ and created the directories for logging. For convinience, we provide some datasets at Onedrive.

# run hetu
python3 ./python/runner.py -c examples/config/dist.yml python3 ./examples/ctr/run_hetu.py --model wdl_criteo --comm Hybrid --cache lru --bound 0 --bsp 0 --nepoch 1 --all --batch-size 256 --embedding-size 512 --cache-limit-ratio 0.1

# run laia
python3 ./python/runner.py -c examples/config/dist.yml python3 ./examples/ctr/run_laia.py --model wdl_criteo --comm Hybrid --cache lru --bound 0 --bsp 0 --nepoch 1 --all --batch-size 256 --embedding-size 512 --cache-limit-ratio 0.1

Reference

The design and implementation of Herald, as well as the experience results, has been documented in a paper accepted for publication at NSDI'24.