CAGNET: Communication-Avoiding Graph Neural nETworks

Description

CAGNET is a family of parallel algorithms for training GNNs that can asymptotically reduce communication compared to previous parallel GNN training methods. CAGNET algorithms are based on 1D, 1.5D, 2D, and 3D sparse-dense matrix multiplication, and are implemented with torch.distributed on GPU-equipped clusters. We also implement these parallel algorithms on a 2-layer GCN.

For more information, please read our ACM/IEEE SC'20 paper Reducing Communication in Graph Neural Network Training.

Contact: Alok Tripathy (alokt@berkeley.edu)

Dependencies

Python 3.6.10
PyTorch 1.3.1
PyTorch Geometric (PyG) 1.3.2
CUDA 10.1.168
GCC 6.4.0

On OLCF Summit, all of these dependencies can be accessed with the following

module load cuda # CUDA 10.1
module load gcc # GCC 6.4.0
module load ibm-wml-ce/1.7.0-3 # PyTorch 1.3.1, Python 3.6.10

# PyG and its dependencies
conda create --name gnn --clone ibm-wml-ce-1.7.0-3
conda activate gnn
pip install --no-cache-dir torch-scatter==1.4.0
pip install --no-cache-dir torch-sparse==0.4.3
pip install --no-cache-dir torch-cluster==1.4.5
pip install --no-cache-dir torch-geometric==1.3.2

Compiling

This code uses C++ extensions. To compile these, run

cd sparse-extension
python setup.py install

Documentation

Each algorithm in CAGNET is implemented in a separate file.

gcn_distr.py : 1D algorithm
gcn_distr_15d.py : 1.5D algorithm
gcn_distr_2d.py : 2D algorithm
gcn_distr_3d.py : 3D algorithm

Each file also as the following flags:

--accperrank <int> : Number of GPUs on each node
--epochs <int> : Number of epochs to run training
--graphname <Reddit/Amazon/subgraph3> : Graph dataset to run training on
--timing <True/False> : Enable timing barriers to time phases in training
--midlayer <int> : Number of activations in the hidden layer
--runcount <int> : Number of times to run training
--normalization <True/False> : Normalize adjacency matrix in preprocessing
--activations <True/False> : Enable activation functions between layers
--accuracy <True/False> : Compute and print accuracy metrics (Reddit only)
--replication <int> : Replication factor (1.5D algorithm only)
--download <True/False> : Download the Reddit dataset

Some of these flags do not currently exist for the 3D algorithm.

Amazon/Protein datasets must exist as COO files in ../data/<graphname>/processed/, compressed with pickle. For Reddit, PyG handles downloading and accessing the dataset (see below).

Running on OLCF Summit (example)

To run the CAGNET 1.5D algorithm on Reddit with

16 processes
100 epochs
16 hidden layer activations
2-factor replication

run the following command to download the Reddit dataset:

python gcn_distr_15d.py --graphname=Reddit --download=True

This will download Reddit into ../data. After downloading the Reddit dataset, run the following command to run training

ddlrun -x WORLD_SIZE=16 -x MASTER_ADDR=$(echo $LSB_MCPU_HOSTS | cut -d " " -f 3) -x MASTER_PORT=1234 -accelerators 6 python gcn_distr_15d.py --accperrank=6 --epochs=100 --graphname=Reddit --timing=False --midlayer=16 --runcount=1 --replication=2

Citation

To cite CAGNET, please refer to:

Alok Tripathy, Katherine Yelick, Aydın Buluç. Reducing Communication in Graph Neural Network Training. Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’20), 2020.

PASSIONLab/CAGNET