MariusGNN (called SystemX for the paper double-blind submission) is a system for resource-efficient training of graph neural networks (GNNs) over large-scale graphs on a single machine. To support such training, MariusGNN uses two main techniques as described in our EuroSys '23 Paper:
- Optimized data structures for neighborhood sampling and GNN aggregation
- Partition replacement policies for out-of-core GNN training
This branch contains the artifact used to produce the experiments reported in the paper. Note that Marius/MariusGNN are under active development. Thus, the main branch has changed and will continue to do so from the code supplied here. In particular, the configuration file format has changed. The configs used here will not run directly using the main branch (or vice versa). For the most up-to-date Marius/MariusGNN see here.
This artifact is licensed under the Apache 2.0 License as described in the LICENSE file.
The following denotes the dependencies and versions used for this artifact and how to build this version of MariusGNN. We highly recommend using Docker for installation as described below.
- Ubuntu 18.04
- CUDA 11.4
- CuDNN 8
- PyTorch 1.9.1
- Python 3.6.9
- Pip 9.0.1
- GCC 9
- CMake 3.20
- Make 4.1
- DGL 0.7.0
- PyTorch Geometric (PyG) 2.03
- dstat
Note that PyTorch/DGL/PyG need to be installed with CUDA support for GPU training. See examples/docker/Dockerfile
for
installation commands.
Strictly speaking, this artifact requires GPU support. However, we provide a minimal working example which runs MariusGNN and baseline systems on the CPU for those who do not have a GPU machine. CPU only installation does not require CUDA/CuDNN or CUDA installations of PyTorch/DGL/PyG. Even for CPU only support we recommend the Docker installation described below.
The following Docker installation installs the necessary dependencies for this artifact and builds MariusGNN. It
requires Docker to be installed (Docker can generally be installed easily using your favorite package
manager), as well as the NVIDIA drivers for GPU support
(check by running nvidia-smi
and verify it can detect the GPUs).
Notes:
- The installation requires Docker to have at least 8GB of memory to work with. This is generally satisfied by
default, but if not (often on Mac), the
docker build
command may throw an error code 137. See here, here, and here for StackOverflow threads on how to increase Docker available memory or fix this issue. Thepip3 install .
command may also cause Docker memory issues. Increase the memory available to Docker or decrease the number of threads used for building MariusGNN (to decrease the number of threads change-j16
in line 42 ofsetup.py
to-j1
for example). One thread should build with 8GB of memory but may take some time (~30mins). - For the experiments, Docker should have access to the full available machine memory. Artifact minimal working examples should run with 8GB.
- For the
docker run
command below, if you have created for example,~/directory/marius_artifact
then pass~/directory/
as the<path to parent directory of marius_artifact>
.
CPU Only Installation: If your machine does not have a GPU, remove the --gpus all
from the docker run
command
in the GPU installation instructions.
GPU Installation:
git clone https://github.com/marius-team/marius.git marius_artifact
cd marius_artifact
git checkout eurosys_2023_artifact
cd examples/docker/
docker build -t marius:artifact .
docker run --gpus all -it --ipc=host -v <path to parent directory of marius_artifact>:/working_dir/ marius:artifact bash
cd marius_artifact
pip3 install .
python3 experiment_manager/run_experiment.py --experiment setup_dgl
sed -i 's/device=eth/device=th/g' /usr/local/lib/python3.6/dist-packages/dgl/optim/pytorch/sparse_optim.py
Assuming installation of the above dependencies, this artifact can also be installed directly without Docker:
git clone https://github.com/marius-team/marius.git marius_artifact
cd marius_artifact
git checkout eurosys_2023_artifact
pip3 install .
python3 experiment_manager/run_experiment.py --experiment setup_dgl
sed -i 's/device=eth/device=th/g' /usr/local/lib/python3.6/dist-packages/dgl/optim/pytorch/sparse_optim.py
The latter two commands in both of the above installation instructions fix a typo in DGL's sparse_optim.py
file.
Depending on your DGL installation, this file may be located in a different location.
This artifact installation installs the following command line tools:
- marius_preprocess: Built-in dataset downloading and preprocessing
- marius_train: Train GNN models using configuration files and the command line
MariusGNN can preprocess graphs and train GNNs from the command line by supplying a configuration file which defines the training procedure.
For example, the following command will preprocess the FB15k-237 graph.
marius_preprocess --dataset fb15k_237 datasets/fb15k237/
Given the preprocessed graph, we can then run MariusGNN according to a configuration file. For example:
CPU: marius_train experiment_manager/example/configs/marius_gs_mem_cpu.yaml
GPU: marius_train experiment_manager/example/configs/marius_gs_mem_gpu.yaml
However, instead of directly running the MariusGNN executable on configuration files, for this artifact we have
provided Python scripts which handle running each experiment (MariusGNN and baselines DGL/PyG).
These Python scripts are contained in the experiment_manager
and are described in the following sections. The
Python scripts automatically pass the corresponding configuration files for each experiment to the corresponding system.
We provide a minimal working example which trains a GraphSage GNN on the (small) FB15k-237 graph for the task of link prediction using MariusGNN and our two baselines DGL and PyG. MariusGNN training contains two configurations: one with the graph stored in memory (GPU or CPU memory depending on GPU support) and one with the graph data stored on disk (as an example of MariusGNN out-of-core support). DGL and PyG only support in-memory training. Note that we use the small FB15k-237 graph to ensure this example is fast to run and doesn't require specialized hardware, but that the paper experiments report results over significantly larger graphs.
For CPU based training, the minimal working example can be run as follows:
python3 experiment_manager/run_experiment.py --experiment fb15k237_mem_cpu --show_output
python3 experiment_manager/run_experiment.py --experiment fb15k237_disk_cpu --show_output
For GPU based training, the minimal working example can be run as follows:
python3 experiment_manager/run_experiment.py --experiment fb15k237_mem_gpu --show_output
python3 experiment_manager/run_experiment.py --experiment fb15k237_disk_gpu --show_output
In both cases, the first script will run MariusGNN, DGL, and PyG with graph data stored in memory (either CPU or GPU). The output should contain the results of running each system for five epochs. Epoch runtimes and MRR (a measure of accuracy for link prediction) are reported for all systems. A summary of the five epochs for each system is presented at the end. The second script runs MariusGNN with graph data stored on disk (again for five epochs). Disk-based training support is a key property of MariusGNN which allows for up to 64x cheaper GNN training over large-scale graphs compared to DGL and PyG.
MariusGNN MRR on FB15k-237 for in-memory training should be roughly between 0.27-0.28 (as shown in Table 7 for GS 237). Disk-based MRR should be roughly between 0.26-0.27 (as shown for COMET in Table 7 for GS 237). In general, variance of 0.01 MRR is to be expected and MRRs in Table 7 will be slightly higher (as those experiments were run for ten epochs instead of five).
We also provide a minimal working example for the task of node classification. We train a three layer GraphSage GNN
on the ogbn-arxiv graph. The format for running the experiments is the same as in the above,
except the fb15k237
in the experiment name is replaced by arxiv
:
python3 experiment_manager/run_experiment.py --experiment arxiv_mem_cpu --show_output
python3 experiment_manager/run_experiment.py --experiment arxiv_disk_cpu --show_output
python3 experiment_manager/run_experiment.py --experiment arxiv_mem_gpu --show_output
python3 experiment_manager/run_experiment.py --experiment arxiv_disk_gpu --show_output
MariusGNN accuracy on ogbn-arxiv for in-memory training should be roughly 66-67%. Disk-based accuracy should be roughly 66%. In general, variance of 1-2% is to be expected.
Notes:
- Minimal working example experiments can be run multiple times by passing
--num_runs <X>
to the Python script, or by overwriting existing runs by passing--overwrite
as described below. - PyG CPU-only training on ogbn-arxiv can be considerably slower than the other experiments, taking ~10mins per epoch.
Here we provide documentation with respect to using and running the experiment manager. For documentation regarding the MariusGNN source code contained in this artifact see the DOCS.md file.
This artifact is organized as follows:
datasets/ // directory created to download and store preprocessed datasets
examples/docker/ // contains the example Dockerfile for installation
experiment_manager/ // suit of python scripts for running paper experiments and baseline systems
baselines/
dgl/ // contains the code for running GNNs using the DGL baseline
pyg/ // contains the code for running GNNs using the PyG baseline
disk/ // contains python scripts and configs for out-of-core microbenchmarks (Table 7, Figure 8)
example/ // contains the python scripts and configs for the minimal working example
sampling/ // contains the python scripts and configs for neighborhood sampling experiments (Table 6)
setup_dgl/ // contains a single experiment to import dgl and allow for typo fixing (used during install)
system_comparisons/ // contains the python scripts and configs for the system comparisons (Tables 3-5, Figure 7)
executor.py // executes training for each system and starts/stop results collection
parsing.py // parses the output of each system, dstat, and nvidia-smi
reporting.py // prints experiment summaries
run_experiment.py // entry script which runs each experiment with desired flags
tracing.py // runs dstat and nvidia-smi during experiments for tracing
results/ // directory created to store experiment results and outpur files
src/ // MariusGNN artifact source code
To reproduce the experiments we have provided Python scripts to run each experiment in the paper. Experiments are run from the repository root directory with:
python3 experiment_manager/run_experiment.py --experiment <experiment_name>
To change which experiment is run simply change the <experiment_name>
. Below we provide a table with the experiment
name for each result reported in the paper as well as the machine needed and any additional notes.
Note that re-running the experiments from the paper can take many hours and require access to AWS P3 GPU
machines leading to significant monetary cost. We report the estimated cost to reproduce each experiment in
the table below.
Experiments cannot be run in parallel on the same machine and must be run one at a time. This is because MariusGNN utilizes the same paths to store intermediate program data across experiments.
By running experiments with the above command, results will be output to the terminal and in the corresponding
results/
directory for the experiment. See the artifact structure section (above) for the locations of the
experiment directories.
The following are additional flags for the experiment_manager/run_experiment.py
script
(in addition to --experiment
):
--overwrite
: Will overwrite previous experiment results. Can be used in case the experiment results get in an
inconsistent state.
--show_output
: By default, the output of each program is redirected to a file and stored in the experiment results/
directory. This flag will show the output in stdout AND redirect the output.
Useful for monitoring the experiment, but may print out a lot of info.
--num_runs
: Number of times to run each experiment.
--enable_dstat
: Will run dstat tracing for the experiment.
--enable_nvidia_smi
: Will run nvidia-smi tracing for the experiment.
In this section we include a list of experiments to reproduce the experimental results reported in the paper. All
experiments are run using the experiment manager as described above. That is, the experiment name is
provided to the run_experiment.py
script with any additional desired arguments.
Experiment Name | Expected Machine | Paper Table Ref. | A.4.2 Major Claim | Estimated Cost (One Run) | Short Explanation | Additional Notes |
---|---|---|---|---|---|---|
papers100m | P3.8xLarge | Table 3 | C1 | 4 hours; $50 | Papers100M epoch time and accuracy for all three systems with graph data stored in CPU memory | Table 3 reports three run average |
papers100m_disk_acc | P3.8xLarge | Table 3 | C1 | 4 hours; $50 | Papers100M disk-based training accuracy for MariusGNN | See disk-based training note below; Table 3 reports three run average |
papers100m_disk_time | P3.2xLarge | Table 3 | C1 | <1 hour; <$1 | Papers100M disk-based training epoch time for MariusGNN | See disk-based training note below; Table 3 reports three run average |
freebase86m_gs | P3.8xLarge | Table 4 | C2 | 30 hours; $350 | Freebase86M epoch time and accuracy for all three systems with graph data stored in CPU memory | - |
freebase86m_gs_disk_acc | P3.8xLarge | Table 4 | C2 | 4 hours; $50 | Freebase86M disk-based training accuracy for MariusGNN | See disk-based training note below |
freebase86m_gs_disk_time | P3.2xLarge | Table 4 | C2 | 3 hours; $10 | Freebase86M disk-based training epoch time for MariusGNN | See disk-based training note below |
training_trace | P3.8xLarge | Table 6 | C3 | 6 hours; $75 | Breakdown of timing operations during training on Papers100M for MariusGNN, DGL, and PyG during in-memory training | See sampling note below |
freebase86m_beta_battles | P3.8xLarge | Table 7 | C4 | 37 hours; $450 | Freebase86M results in Table 7 for in-memory training, COMET, and BETA using DistMult, GraphSage, and GAT models | See disk-based training microbenchmark note below |
Notes:
-
Disk-based training: For disk-based training system comparisons, we report runtime using the smaller P3.2xLarge machine which does not have enough memory to store the full graphs in CPU memory. However, to measure accuracy/MRR, we compute evaluation metrics using the full graph in main memory (for an apples-to-apples comparison to in-memory training). To do this, we run the exact same training setup on a larger machine and in between each epoch (during evaluation only) load the full graph into main memory to compute accuracy/MRR. Alternatively, one could train on the P3.2xLarge machine without evaluation and then export the final embeddings to a larger machine for full graph evaluation (although this would prevent access to the per-epoch validation set metrics).
-
Disk-based training microbenchmarks: Unlike for the system comparisons, for simplicity, for the disk-based training microbenchmarks (e.g., Table 7 and Figure 8), we do not use a separate machine for measuring accuracy and throughput. Instead we use a single machine with sufficient memory for full graph evaluation, but during disk-based training using COMET or BETA, the full graph is loaded into memory only during evaluation. Training proceeds using the partition buffer and partition replacement policy and only a fraction of the graph is in memory at a given time. Using a single machine reduces the number of experiments and machines that needed to be managed. Further, while the throughput numbers for COMET/BETA reported by this method may not match the throughput these methods would achieve on a machine without sufficient memory to store the full graph (e.g., a P3.2xLarge), the throughput numbers are sufficient for comparing the two methods (as the throughput numbers for both COMET and BETA were generated using the same hardware).
-
Sampling: In Table 6 we report CPU sampling time as the total time required to sample multi-hop neighborhoods. This includes 1) identifying the multi-hop neighborhood and then 2) loading the features for the unique nodes in the neighborhood into the mini batch (to prepare the mini batch for transfer to the GPU). The
training_trace
experiment attempts to measure these two separately and output the results under the names "sampling" and "loading" times, however this separation can only be done for MariusGNN (due to the dataloaders in DGL and PyG). Thus, in Table 6 we report the sum of the outputs "sampling" and "loading" as the CPU Sampling Time for MariusGNN and report the outputs of "loading" (which includes "sampling") for DGL and PyG. -
We report validation set accuracy in the paper as test sets are not expected to be publicly available for all datasets.
If you have hit an issue with the system, the scripts, or the results, please let us know (contact: waleffe@wisc.edu, or open an issue) and we will investigate and fix the issue if needed.