Artifact Evaluation for FGNN [EuroSys'22]

This repository contains scripts and instructions for reproducing the experiments in our EuroSys '22 paper "FGNN: A Factored System For Sample-based GNN Training Over GPUs".

FGNN (also called SamGraph) is a factored system for sample-based GNN training over GPUs, where each GPU is dedicated to the task of graph sampling or model training. It accelerates both tasks by eliminating GPU memory contention. Furthermore, FGNN embodies a new pre-sampling based caching policy that takes both sampling algorithms and GNN datasets into account, showing an efficient and robust caching performance.

Project Structure
Paper's Hardware Configuration
Installation
Dataset Preprocessing
QuickStart: Use FGNN to train GNN models
Experiments
License
Academic and Conference Papers

Project Structure

> tree .
├── datagen                     # Dataset Preprocessing
├── example
│   ├── dgl
│   │   ├── multi_gpu           # DGL models
│   ├── pyg
│   │   ├── multi_gpu           # PyG models
│   ├── samgraph
│   │   ├── balance_switcher    # FGNN Dynamic Switch
│   │   ├── multi_gpu           # FGNN models
│   │   ├── sgnn                # SGNN models
│   │   ├── sgnn_dgl            # DGL PinSAGE models(SGNN simulated)
├── exp                         # Experiment Scripts
│   ├── figXX
│   ├── tableXX
├── samgraph                    # FGNN, SGNN source codes
└── utility                     # Useful tools for dataset preprocessing

Paper's Hardware Configuration

8 * NVIDIA V100 GPUs (16GB of memory each)
2 * Intel Xeon Platinum 8163 CPUs (24 cores each)
512GB RAM

In the AE machine we provided, each V100 GPU has 32GB memory.

Installation

We have prepared an out-of-the-box environment (with all preprocessed datasets) for the AE reviewers. AE reviewers do not need to perform the following steps if they choose to run the experiments on the machine we provide.

The AE machine IP and account information can be found in the AE appendix.

Software Version

Ubuntu 18.04 or Ubuntu 20.04
gcc-7, g++-7
CMake >= 3.14
CUDA v10.1
Python v3.8
PyTorch v1.7.1
DGL V0.7.1
PyG v2.0.1

Install CUDA10.1

FGNN is built on CUDA 10.1. Follow the instructions in https://developer.nvidia.com/cuda-10.1-download-archive-base to install CUDA 10.1, and make sure that /usr/local/cuda is linked to /usr/local/cuda-10.1.

Install GCC-7

CUDA10.1 requires GCC (version<=7). Hence, make sure that gcc is linked to gcc-7, and g++ is linked to g++-7.

```bash
# Ubuntu
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-7 7
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-7 7
```

Install GNN Training Systems

We use conda to manage our python environment.

conda create -n fgnn_env python==3.8 pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=10.1 -c pytorch -y
conda activate fgnn_env
conda install cudnn numpy scipy networkx tqdm pandas ninja cmake -y # System cmake is too old to build DGL
sudo apt install gnuplot # Install gnuplot for experiments:

Download GNN systems.

# Download FGNN source code
git clone --recursive https://github.com/SJTU-IPADS/fgnn-artifacts.git

Install DGL, PyG, and FastGraph. The package FastGraph is used to load datasets for GNN systems in all experiments.

# Install DGL
./fgnn-artifacts/3rdparty/dgl_install.sh

# Install PyG
./fgnn-artifacts/3rdparty/pyg_install.sh

# Install fastgraph
./fgnn-artifacts/utility/fg_install.sh

Install FGNN (also called SamGraph) and SGNN.
```
cd fgnn-artifacts
./build.sh
```

Change ULIMIT

Both DGL and FGNN need to use a lot of system resources. DGL CPU sampling requires cro-processing communications while FGNN's global queue requires memlock(pin) memory to enable faster memcpy between host memory and GPU memory. Hence we have to set the user limit.

Append the following content to /etc/security/limits.conf and then reboot:

* soft nofile 65535         # for DGL CPU sampling
* hard nofile 65535         # for DGL CPU sampling
* soft memlock 200000000    # for FGNN global queue
* hard memlock 200000000    # for FGNN global queue

After reboot you can see:

> ulimit -n
65535

> ulimit -l
200000000

Dataset Preprocessing

AE reviewers do not need to perform the following steps if they choose to run the experiments on the machine we provided. We have already downloaded and preprocessed the dataset in /graph-learning/samgraph.

See datagen/README.md to find out how to preprocess datasets.

QuickStart: Use FGNN to train GNN models

FGNN is compiled into Python library. We have written several GNN models using FGNN’s APIs. These models are in fgnn-artifacts/example and are easy to run as following:

cd fgnn-artifacts/example

python samgraph/multi_gpu/train_gcn.py --dataset papers100M --num-train-worker 1 --num-sample-worker 1 --pipeline --cache-policy pre_sample --cache-percentage 0.1 --num-epoch 10 --batch-size 8000

Experiments

Our experiments have been automated by scripts (run.py). Each figure or table in our paper is treated as one experiment and is associated with a subdirectory in fgnn-artifacts/exp. The script will automatically run the experiment, save the logs into files, and parse the output data from the files.

Note that running all experiments may take several hours. This table lists the expected running time for each experiment.

See exp/README.md to find out how to conduct the experiments.

License

FGNN is released under the Apache License 2.0.

Academic and Conference Papers

[EuroSys] FGNN: A Factored System for Sample-based GNN Training over GPUs. Jianbang Yang, Dahai Tang, Xiaoniu Song, Lei Wang, Qiang Yin, Rong Chen, Wenyuan Yu, Jingren Zhou. Proceedings of the 17th European Conference on Computer Systems, Rennes, France, April, 2022.

hussien/fgnn-artifacts