/HeaRT

Primary LanguagePythonMIT LicenseMIT

HeaRT

Official code for the NeurIPS'23 paper "Evaluating Graph Neural Networks for Link Prediction: Current Pitfalls and New Benchmarking", and ICLR'24 paper "Revisiting Link Prediction: A Data Perspective".

Installation

Please see the installation.md for how to install the proper requirements.

Download Data

All data can be downloaded by running the download_data.sh script:

cd HeaRT  # Must be in the root directory
bash download_data.sh

This includes the negative samples generated by HeaRT and the splits for Cora, Citeseer, and Pubmed. The data for the OGB datasets will be automatically downloaded from the ogb package.

Reproduce Results

The commands needed to reproduce all the results with the appropriate hyperparameters can be found in the scripts/hyparameters directory. We include a file for each dataset which includes the command to train and evaluate each possible method.

For example, to reproduce the results on ogbl-collab under the existing evaluation setting, the command for each method can be found in the ogbl-collab.sh file located in the scripts/hyperparameter/existing_setting_ogb/ directory.

To run the code, we need to first go to the appropriate setting directory. This includes:

  • benchmarking/exist_setting_small: Run models on Cora, Citeseer, and Pubmed under the existing setting.
  • benchmarking/exist_setting_ogb: Run models on ogbl-collab, ogbl-ppa, and ogbl-citation2 under the existing setting.
  • benchmarking/exist_setting_ddi: Run models on on ogbl-ddi under the existing setting.
  • benchmarking/HeaRT_small: Run models on Cora, Citeseer, and Pubmed under HeaRT.
  • benchmarking/HeaRT_ogb: Run models on ogbl-collab, ogbl-ppa, and ogbl-citation2 under HeaRT.
  • benchmarking/HeaRT_ddi/: Run models on ogbl-ddi under HeaRT.

Below we give examples of running GCN on the different groups of datasets under both settings:

Cora/Citeseer/Pubmed under the existing setting.

cd benchmarking/exist_setting_small/
python  main_gnn_CoraCiteseerPubmed.py  --data_name cora  --gnn_model GCN --lr 0.01 --dropout 0.3 --l2 1e-4 --num_layers 1  --num_layers_predictor 3 --hidden_channels 128 --epochs 9999 --kill_cnt 10 --eval_steps 5  --batch_size 1024

ogbl-collab under the existing setting (similar for ogbl-ppa and ogbl-citation2):

cd benchmarking/exist_setting_ogb/
python main_gnn_ogb.py  --use_valedges_as_input  --data_name ogbl-collab  --gnn_model GCN --hidden_channels 256 --lr 0.001 --dropout 0.  --num_layers 3 --num_layers_predictor 3 --epochs 9999 --kill_cnt 100  --batch_size 65536 

ogbl-ddi under the existing setting:

cd benchmarking/exist_setting_ddi/
python main_gnn_ddi.py --data_name ogbl-ddi --gnn_model GCN  --lr 0.01 --dropout 0.5  --num_layers 3 --num_layers_predictor 3  --hidden_channels 256 --epochs 9999 --eval_steps 1 --kill_cnt 100 --batch_size 65536 

Cora/Citeseer/Pubmed under HeaRT:

cd benchmarking/HeaRT_small/
python main_gnn_CoraCiteseerPubmed.py  --data_name cora  --gnn_model GCN  --lr 0.001 --dropout 0.5 --l2 0 --num_layers 1 --hidden_channels 256  --num_layers_predictor 3  --epochs 9999 --kill_cnt 10 --eval_steps 5  --batch_size 1024 

ogbl-collab under HeaRT (similar for ogbl-ppa and ogbl-citation2):

cd benchmarking/HeaRT_ogb/
python main_gnn_ogb.py  --data_name ogbl-collab  --use_valedges_as_input --gnn_model GCN  --lr 0.001 --dropout 0.3 --num_layers 3 --hidden_channels 256  --num_layers_predictor 3 --epochs 9999 --kill_cnt 100 --eval_steps 1  --batch_size 65536  

ogbl-ddi under HeaRT:

cd benchmarking/HeaRT_ddi/
python main_gnn_ddi.py  --data_name ogbl-ddi   --gnn_model GCN --lr 0.01 --dropout 0 --num_layers 3 --hidden_channels 256  --num_layers_predictor 3 --epochs 9999 --kill_cnt 100 --eval_steps 1  --batch_size 65536    

Generate Negative Samples using HeaRT

The set of negative samples generated by HeaRT, that were used in the study, can be reproduced via the scripts in the scripts/HeaRT/ directory.

A custom set of negative samples can be produced by running the heart_negatives/create_heart_negatives.py script. Multiple options exist to customize the negative samples. This includes:

  • The CN metric used. Can be either CN or RA (default is RA). Specified via the --cn-metric argument.
  • The aggregation function used. Can be either min or mean (default is min). Specified via the --agg argument.
  • The number of negatives generated per positive sample. Specified via the --num-samples argument (default is 500).
  • The PPR parameters. This includes the tolerance used for approximating the PPR (--eps argument) and the teleporation probability (--alpha argument). alpha is fixed at 0.15 for all datasets. For the tolerance, eps, we recommend following the settings found in scripts/HeaRT.

Updates

November 3rd, 2023

  • Modified the negative samples for ogbl-collab to allow train/valid positive samples to be negatives. Please see Appendix I in the paper for our rationale.

Feb 17th, 2024

Cite

@inproceedings{
  li2023evaluating,
  title={Evaluating Graph Neural Networks for Link Prediction: Current Pitfalls and New Benchmarking},
  author={Li, Juanhui and Shomer, Harry and Mao, Haitao and Zeng, Shenglai and Ma, Yao and Shah, Neil and Tang, Jiliang and Yin, Dawei},
  booktitle={Neural Information Processing Systems {NeurIPS}, Datasets and Benchmarks Track},
  year={2023}
}
@article{mao2023revisiting,
  title={Revisiting link prediction: A data perspective},
  author={Mao, Haitao and Li, Juanhui and Shomer, Harry and Li, Bingheng and Fan, Wenqi and Ma, Yao and Zhao, Tong and Shah, Neil and Tang, Jiliang},
  journal={The Twelfth International Conference on Learning Representations},
  year={2024}
}