This repository contains scripts and instructions for reproducing the experiments in our EuroSys '22 paper "FGNN: A Factored System For Sample-based GNN Training Over GPUs".
FGNN (also called SamGraph) is a factored system for sample-based GNN training over GPUs, where each GPU is dedicated to the task of graph sampling or model training. It accelerates both tasks by eliminating GPU memory contention. Furthermore, FGNN embodies a new pre-sampling based caching policy that takes both sampling algorithms and GNN datasets into account, showing an efficient and robust caching performance.
- Project Structure
- Paper's Hardware Configuration
- Installation
- Dataset Preprocessing
- QuickStart: Use FGNN to train GNN models
- Experiments
- License
- Academic and Conference Papers
> tree .
├── datagen # Dataset Preprocessing
├── example
│ ├── dgl
│ │ ├── multi_gpu # DGL models
│ ├── pyg
│ │ ├── multi_gpu # PyG models
│ ├── samgraph
│ │ ├── balance_switcher # FGNN Dynamic Switch
│ │ ├── multi_gpu # FGNN models
│ │ ├── sgnn # SGNN models
│ │ ├── sgnn_dgl # DGL PinSAGE models(SGNN simulated)
├── exp # Experiment Scripts
│ ├── figXX
│ ├── tableXX
├── samgraph # FGNN, SGNN source codes
└── utility # Useful tools for dataset preprocessing
- 8 * NVIDIA V100 GPUs (16GB of memory each)
- 2 * Intel Xeon Platinum 8163 CPUs (24 cores each)
- 512GB RAM
In the AE machine we provided, each V100 GPU has 32GB memory.
We have prepared an out-of-the-box environment (with all preprocessed datasets) for the AE reviewers. AE reviewers do not need to perform the following steps if they choose to run the experiments on the machine we provide.
The AE machine IP and account information can be found in the AE appendix.
- Ubuntu 18.04 or Ubuntu 20.04
- gcc-7, g++-7
- CMake >= 3.14
- CUDA v10.1
- Python v3.8
- PyTorch v1.7.1
- DGL V0.7.1
- PyG v2.0.1
FGNN is built on CUDA 10.1. Follow the instructions in https://developer.nvidia.com/cuda-10.1-download-archive-base to install CUDA 10.1, and make sure that /usr/local/cuda
is linked to /usr/local/cuda-10.1
.
CUDA10.1 requires GCC (version<=7). Hence, make sure that gcc
is linked to gcc-7
, and g++
is linked to g++-7
.
```bash
# Ubuntu
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-7 7
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-7 7
```
We use conda to manage our python environment.
-
We use conda to manage our python environment.
conda create -n fgnn_env python==3.8 pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=10.1 -c pytorch -y conda activate fgnn_env conda install cudnn numpy scipy networkx tqdm pandas ninja cmake -y # System cmake is too old to build DGL sudo apt install gnuplot # Install gnuplot for experiments:
-
Download GNN systems.
# Download FGNN source code git clone --recursive https://github.com/SJTU-IPADS/fgnn-artifacts.git
-
Install DGL, PyG, and FastGraph. The package FastGraph is used to load datasets for GNN systems in all experiments.
# Install DGL ./fgnn-artifacts/3rdparty/dgl_install.sh # Install PyG ./fgnn-artifacts/3rdparty/pyg_install.sh # Install fastgraph ./fgnn-artifacts/utility/fg_install.sh
-
Install FGNN (also called SamGraph) and SGNN.
cd fgnn-artifacts ./build.sh
Both DGL and FGNN need to use a lot of system resources. DGL CPU sampling requires cro-processing communications while FGNN's global queue requires memlock(pin) memory to enable faster memcpy between host memory and GPU memory. Hence we have to set the user limit.
Append the following content to /etc/security/limits.conf
and then reboot
:
* soft nofile 65535 # for DGL CPU sampling
* hard nofile 65535 # for DGL CPU sampling
* soft memlock 200000000 # for FGNN global queue
* hard memlock 200000000 # for FGNN global queue
After reboot you can see:
> ulimit -n
65535
> ulimit -l
200000000
AE reviewers do not need to perform the following steps if they choose to run the experiments on the machine we provided. We have already downloaded and preprocessed the dataset in /graph-learning/samgraph
.
See datagen/README.md
to find out how to preprocess datasets.
FGNN is compiled into Python library. We have written several GNN models using FGNN’s APIs. These models are in fgnn-artifacts/example
and are easy to run as following:
cd fgnn-artifacts/example
python samgraph/multi_gpu/train_gcn.py --dataset papers100M --num-train-worker 1 --num-sample-worker 1 --pipeline --cache-policy pre_sample --cache-percentage 0.1 --num-epoch 10 --batch-size 8000
Our experiments have been automated by scripts (run.py
). Each figure or table in our paper is treated as one experiment and is associated with a subdirectory in fgnn-artifacts/exp
. The script will automatically run the experiment, save the logs into files, and parse the output data from the files.
Note that running all experiments may take several hours. This table lists the expected running time for each experiment.
See exp/README.md
to find out how to conduct the experiments.
FGNN is released under the Apache License 2.0.
[EuroSys] FGNN: A Factored System for Sample-based GNN Training over GPUs. Jianbang Yang, Dahai Tang, Xiaoniu Song, Lei Wang, Qiang Yin, Rong Chen, Wenyuan Yu, Jingren Zhou. Proceedings of the 17th European Conference on Computer Systems, Rennes, France, April, 2022.