Ginex

Ginex is a GNN training system for efficient training of a billion-scale dataset on a single machine by using SSD as a memory extension. Ginex accelerates the entire training procedure by provably optimal in-memory caching of feature vectors which reside on SSD without any negative implication on training quality.

Please refer to the full paper here.

Installation and Running a Toy Example

Follow the instructions below to install the requirements and run a toy example using ogbn_papers100M dataset.

Basic Settings

Disable read_ahead.

sudo -s
echo 0 > /sys/block/$block_device_name/queue/read_ahead_kb

Install necessary Linux packages.
1. sudo apt-get install -y build-essential
2. sudo apt-get install -y cgroup-tools
3. sudo apt-get install -y unzip
4. sudo apt-get install -y python3-pip and pip3 install --upgrade pip
5. Compatible NVIDIA CUDA driver and toolkit. Visit NVIDIA CUDA Installation Guide for Linux for details.

Install necessary Python modules.

PyTorch with version of >= 1.9.0. Visit here for details.
pip3 install tqdm
pip3 install ogb
PyG. Visit here for details.

Ninja

sudo wget https://github.com/ninja-build/ninja/releases/download/v1.8.2/ninja-linux.zip
sudo unzip ninja-linux.zip -d /usr/local/bin/
sudo update-alternatives --install /usr/bin/ninja ninja /usr/local/bin/ninja 1 --force

Use cgroup to mimic the setting where the dataset size is much larger than the main memory size, as assumed in the paper, with ogbn_papers100M dataset. We recommend to limit the memory size to 8GB.
```
sudo -s
cgcreate -g memory:8gb
echo 8000000000 > /sys/fs/cgroup/memory/8gb/memory.limit_in_bytes
```

Make sure to allocate enough swap space. We recommend to allocate at least 4GB for swap space.

sudo fallocate -l 4G swap.img
sudo chmod 600 swap.img
sudo mkswap swap.img
sudo swapon swap.img

Running a toy example

Clone our library

git clone https://github.com/SNU-ARC/Ginex.git

Prepare dataset
```
python3 prepare_dataset.py
```

Preprocess (Neighbor cache construction)

python3 create_neigh_cache.py --neigh-cache-size 6000000000

Get PYTHONPATH
```
python3 get_pythonpath.py
```
Run baseline, i.e., PyG extended to support disk-based processing of graph dataset (denoted as PyG+ in the paper). Replace PYTHONPATH=... with the outcome of step 3. -W ignore option is used to ignore warnings.
```
sudo PYTHONPATH=/home/user/.local/lib/python3.8/site-packages cgexec -g memory:8gb python3 -W ignore run_baseline.py
```

Run Ginex. Replace PYTHONPATH=... with the outcome of step 3. -W ignore option is used to ignore warnings.

sudo PYTHONPATH=/home/user/.local/lib/python3.8/site-packages cgexec -g memory:8gb python3 -W ignore run_ginex.py --neigh-cache-size 6000000000 --feature-cache-size 6000000000 --sb-size 1500

Results

The following is the result of the toy example on our local server.

Environment

CPU: Intel Xeon Gold 6244 CPU 8-core (16 logical cores with hyper-threading) @ 3.60GHz
GPU: NVIDIA Tesla V100 16GB PCIe
Memory: Samsung DDR4-2666 64GB (32GB X 2) (cgroup of 8GB is used)
Storage: Samsung PM1725b 8TB PCIe Gen3 8-lane
S/W: Ubuntu 18.04.5 & CUDA 11.4 & Python 3.6.9 & PyTorch 1.9

Baseline

Per epoch training time: 216.1687 sec

Ginex

Per epoch training time: 99.5562 sec (Speedup of 2.2x)

Maintainer

Yeonhong Park (parkyh96@gmail.com)

Sunhong Min (sunhongmin@snu.ac.kr)

Citation

Please cite our paper if you find it useful for your work:

@inproceedings{park2022vldb,
 author    = {Yeonhong Park and Sunhong Min and Jae W. Lee},
 title     = {Ginex: SSD-enabled Billion-scale Graph Neural Network Training on a Single Machine via Provably Optimal In-memory Caching},
 booktitle = {Proceedings of the VLDB Endowment},
 volume    = {15},
 number    = {11},
 year      = {2022}
}