GCoD: Graph Convolutional Network Acceleration via Dedicated Algorithm and Accelerator Co-Design

Haoran You, Tong Geng, Yongan Zhang, Ang Li, Yingyan Lin (Also credit to Cheng Wan's help and discussion on graph paritioning)

Accepted by HPCA 2022. More Info: [ Paper | Slide | Youtube | Github ]

Overview of the Co-Design Framework

We propose a GCN algorithm and accelerator Co-Design framework dubbed GCoD.

On the algorithm level, GCoD integrates a split and conquer training strategy to polarize the graphs to be either denser or sparser in local neighborhoods without compromising the model accuracy, resulting in adjacency matrices that have two levels of workload and enjoys largely enhanced regularity and thus ease of acceleration.
On the hardware level, GCoD integrates a dedicated two-pronged accelerator to leverage GCoD algorithm's resulting graph adjacency matrices for further boosting the acceleration efficiency. Results of the two branches are then aggregated without conflicts.

Usage of the Provided Minimalistic Codebase

Prerequisite

conda install pytorch==1.7.0 torchvision torchaudio cudatoolkit=11.0 -c pytorch
pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.7.0+cu110.html
pip install torch-sparse -f https://pytorch-geometric.com/whl/torch-1.7.0+cu110.html
pip install torch-cluster -f https://pytorch-geometric.com/whl/torch-1.7.0+cu110.html
pip install torch-spline-conv -f https://pytorch-geometric.com/whl/torch-1.7.0+cu110.html
pip install torch-geometric
pip install tqdm
pip install ogb
conda install -c dglteam dgl-cuda11.0

Pretrain GCNs on partitioned graphs

python train.py \
    --model GCN \
    --dataset Cora \
    --partition \
    --device cuda:0 \
    --save_prefix pretrain_partition \
    --quant \
    --enable_chunk_q \
    --num_act_bits 6 \
    --num_wei_bits 6 \
    --num_agg_bits 6

More examples are provided in ./scripts/cmd_pretrain.sh.

Supported models

GCN
GAT
GIN
GraphSAGE

Supported datasets

Cora
CiteSeer
Pubmed
NELL

For training on Reddit or larger graphs, please refer to DeeperGCN or BNS-GCN and adapt the code accordingly.

Tuning according to both sparse and polarization/diagonalization regularization terms

python tune.py \
    --model GCN \
    --dataset Cora \
    --hard \
    --device cuda:4 \
    --save_prefix graph_tune \
    --iteration 1 \
    --ratio_graph 10 \
    --quant \
    --num_bits 16

More examples are provided in ./scripts/cmd_tune.sh.

Visualization of the Resulting Adjacency Matrix

Visualization scripts are provided in ./scripts/cmd_plot.sh

Ideas in Hardware Architecture

Below figure illustrates the overall micro-architecture of the GCoD accelerator. For better processing elements (PEs) utilization and reduced off-chip memory access during the performance dominant aggregation phase, GCoD accelerator consists of two separate computing branches with each dedicated to process the denser workload and sparser workload of GCoD algorithm's resulting adjacency matrices, respectively.

Speedups over Other Platforms

Extensive experiments and ablation studies validate that our GCoD consistently reduces the number of off-chip accesses, leading to speedups of 15286x, 294x, 7.8x, and 2.5x as compared to CPUs, GPUs, and prior-art GCN accelerators including HyGCN and AWB-GCN, respectively

Citation

If you find this codebase useful to your research, please cite:

@inproceedings{you2021gcod,
  title={GCoD: Graph Convolutional Network Acceleration via Dedicated Algorithm and Accelerator Co-Design},
  author={You, Haoran and Geng, Tong and Zhang, Yongan and Li, Ang and Lin, Yingyan},
  booktitle={The 28th IEEE International Symposium on High-Performance Computer Architecture (HPCA-28)},
  year={2022}
}