Beyond Real-world Benchmark Datasets: An Empirical Study of Node Classification with GNNs

This paper is accepted to NeurIPS 2022 Datasets and Benchmarks🎉🎉 [paper]

Our empirical studies clarify the strengths and weaknesses of GNNs from four major characteristics of real-world graphs with class labels of nodes, i.e., 1) class size distributions (balanced vs. imbalanced), 2) edge connection proportions between classes (homophilic vs. heterophilic), 3) attribute values (biased vs. random), and 4) graph sizes (small vs. large).

Supported Models

MLP, GCN, ChebNet, MoNet, GAT, SGC, JK-GCN, JK-GAT, JK-GraphSAGE, GraphSAGE, GraphSAINT-GAT, GraphSAINT-GraphSAGE, Shadow-GAT, Shadow-GraphSAGE, H2GCN, FSGNN, GPRGNN, LINKX

Installation

All our experiments are executed with Python3.7.13. Please run scripts below to prepare an environment for our codebase.

pip install torch torchvision torchaudio

Please see the official instruction to install pytorch. We use torch==1.12.1.

pip install -r requirements.txt

Dataset Generation (GenCAT)

Choose a base dataset and Generate dataset with GenCAT. If the base dataset is not in your directory, it will be downloaded automatically. Dataset generated with pre-set parameters will be saved under the data directory.

python scripts/run_gencat.py --dataset cora

# To reproduce sythetic datasets for Section 6.1.1 (various class size distributions)
python scripts/run_gencat.py --dataset cora --exp classsize
 
# To reproduce sythetic datasets for Section 6.1.2 (various edge connection proportions between classes)
python scripts/run_gencat.py --dataset cora --exp hetero_homo

# To reproduce sythetic datasets for Section 6.1.3 (various attributes)
python scripts/run_gencat.py --dataset cora --exp attribute

# To reproduce sythetic datasets for Section 6.1.4 (various numbers of nodes and edges)
python scripts/run_gencat.py --dataset cora --exp scalability_node_edge

# To reproduce sythetic datasets for Section 6.2 (various numbers of edges)
python scripts/run_gencat.py --dataset cora --exp scalability_edge

[Optional] Already Generated Dataset Link

If you do not want to generate, you can download synthetic datasets that we use in the paper: Google Drive or Our Lab Repository (These links provide the same dataset).

Please put the unzipped folder as ./data/ after downloading it.

Reproduction of Experiments in the Paper

All plots in Figure 1-6 are shown in a notebook.

The raw experimental results are stored in csv-formated files.

Experiments in Supplementary Material

All plots are shown in a notebook.

You can find the raw experimental results in csv-formated files.

Hyperparameters

Search Space

The hyperparameter search space for each model is listed in json files.

The Best Sets of Hyperparameters for Each Experiment

Also, we show the best parameter sets used for the experiments.

Instruction for Running GNNs

If the base dataset is not in your directory, it will be downloaded automatically. Please go to folder models

python train_model.py --train_rate 0.6 --val_rate 0.2 --RPMAX 2 --dataset cora --net GCN

Format of Datasets

Although all datasets are internally converted to the pytorch format, you can convert datasets into other formats so that you can use datasets for your own use cases. The converted data will be stored in the dataset's directory. Formats that this codebase supports are npz, semb, and planetoid.

An example script is following:

python scripts/converter.py --format npz --dataset GenCAT_texas

Customization

Below we describe how to customize this codebase for your own research / product.

How to Support Your Own GNN models?

Add your model class in ./models/GNN_models.py. You would also need to do some minor update to net_dict variable, import from GNN_models and parser of the net argument in ./models/train_model.py, so that you can specify that model with an argument.

How to Prepare Your Own Dataset?

Add your dataset class in ./models/dataset_utils.py. You would also need to do some minor update to DataLoader function in ./models/dataset_utils.py so that you can use the class. The raw data can be in any format, but after preprocessing with the dataset class, it needs to be converted to pytorch format. When you use GenCAT to create your datasets, you would also need to add your dataset into datasets_to_convert list in ./scripts/run_gencat.py so that the format will be converted to the planetoid format.

How to Tune Hyperparameters?

We use third party service, cometml, to tune hyperparameters with a grid search algorithm. If you choose the same way, follow the instructions below.

Set all parameters to parser in ./models/train_model.py.
Write all the parameters to be explored into a json file({YOUE_MODEL}.json) and set it in ./configs/parameter_search/.
Set your cometml information as environment variables.
Run ./models/train_model.py with --parameter_search.

python train_model.py --train_rate 0.6 --val_rate 0.2 --RPMAX 10 --dataset cora --net GCN --parameter_search

A simple way to use the best parameter set you explored is to add it in ./configs/best_params/best_params_supervised.csv as a new row. If there is a line in the file with a matching dataset/net combination, then you can run ./models/train_model.py using the best parameters. In this case, you do not have to set the best parameters as arguments. You can also choose to set the best parameters as arguments when running the code.

Built-in Datasets

This framework allows users to use real-world datasets as follows:

Dataset	# Nodes	# Edges
Cora	2,708	5,278
Pubmed	19,717	44,324
Citeseer	3,327	4,552
Texas	183	295
Wisconsin	251	466
Cornell	183	280
Actor	7,600	26,752
Chameleon	2,277	31,421
Squirrel	5,200	198,493
BlogCatalog	5,196	343,486
Flickr	7,575	479,476

By changing --dataset [dataset name], users can choose a dataset.

seijimaekawa/empirical-study-of-GNNs