N.B.: There is a newer version of this project and paper, aimed at working on more datasets, including the OGB large-scale datasets. Check it out here: https://github.com/venomouscyanide/S3GRL_OGB.
S3GRL (Scalable Simplified Subgraph Representation Learning) is a subgraph representation learning (SGRL) framework aimed at faster link prediction. S3GRL introduces subgraph sampling and subgraph diffusion operator pairs that allow for fast precomputation leading to faster runtimes.
Please read our paper here: https://arxiv.org/abs/2301.12562
This work also constitutes as the primary contribution in my Master's thesis titled "Scalable Subgraph Representation Learning through Simplification", which was written as part of my Master's degree at UOIT. You can read it here.
Our S3GRL framework: In the preprocessing phase (shown by the shaded blue arrow), first multiple subgraphs are extracted around the target nodes
A higher quality architecture diagram can be found here: SG3RL_arch.pdf
All the requirements for getting the dev environment ready is available in quick_install.sh
.
All our codes can be run by setting a JSON configuration file. Example configuration files can be found in configs/paper/
. Once you have the configuration JSON file setup, run our codes using python sgrl_run_manager.py --config your_config_file_here.json --results_json your_result_file.json
. This command produces your_result_file.json
which contains the efficacy score using the configuration on the dataset the experiments are being run on.
Currently, our S3GRL framework support two instances, PoS and SoP. Both instances are available in tuned_SIGN.py
. You can add your own instances by modifying this file.
Specific arguments related to our framework:
sign_k
- The number of diffusion operators to create (corresponds to r in our paper)sign_type
- Set the type of S3GRL model. Either PoS or SoP.optimize_sign
- Choose to use optimized codes for running PoS or SoP.init_features
- Initialize features for non-attributed datasets. Our S3GRL models require initial node features to work. Choose between 'degree', 'eye' or 'n2v'.k_heuristic
- Choose to useCCN
(center-common-neighbor pooling) for the nodes other than source-target nodes.k_node_set_strategy
- How to choose the nodes forCCN
. Either intersection or union. Intersection means you choose the common neighbors. Works whenk_heuristic
is True.k_pool_strategy
- How to pool the nodes other than source-target in the forward pass forCCN
(the nodes you select usingk_node_set_strategy
). Either mean or sum pooling. Works whenk_heuristic
is True.init_representation
- Use an unsupervised model to train the initial features before running S3GRL. Choose between 'GIC', 'ARGVA', 'GAE', 'VGAE'.
We support any PyG dataset. However, the below list covers all the datasets we use in our paper's experiments:
- Planetoid Dataset (Cora, PubMed, CiteSeer) from "Revisiting Semi-Supervised Learning with Graph Embeddings https://arxiv.org/abs/1603.08861"
- SEAL datasets (USAir, Yeast etc. introduced in the original paper) from "Link prediction based on graph neural networks https://arxiv.org/pdf/1802.09691.pdf"
Notably, the OGB dataset support is actively being developed in a private fork.
We currently don't have an issue/PR template. However, if you find an issue in our code please create an issue in GitHub. It would be great if you could give as much information regarding the issue as possible (what command was run, what are the python package versions, providing full stack trace etc.).
If you have any further questions, you can reach out to us (the authors) via email and we will be happy to have a conversation. Paul Louis, Shweta Ann Jacob
- All baselines (except SGRL) can be reproduced using
baselines/run_helpers/run_*.py
, where * is the respective baseline script. - All SGRL baselines (except WalkPool) can be reproduced
using
python sgrl_run_manager.py --config configs/paper/table_2.json
. - WalkPool results can be reproduced using
bash run_ssgrl.sh
by running from Software/WalkPooling/bash - S3GRL results can be reproduced using
python sgrl_run_manager.py --config configs/paper/auc_s3grl.json
.
- WalkPool can be reproduced using
bash run_ssgrl_profile_attr.sh
andbash run_ssgrl_profile_non.sh
by running from Software/WalkPooling/bash - All other S3GRL and SGRL models can be reproduced
using
python sgrl_run_manager.py --config configs/paper/profiling_attr.json
andpython sgrl_run_manager.py --config configs/paper/profiling_non.json
- S3GRL models with ScaLed enabled can be reproduced
using
python sgrl_run_manager.py --config configs/paper/scaled.json
.
The code for S3GRL is based off a clone of SEAL-OGB by Zhang et al. (https://github.com/facebookresearch/SEAL_OGB) and ScaLed by Louis et al. (https://github.com/venomouscyanide/ScaLed). The baseline softwares used are adapted from GIC by Mavromatis et al. (https://github.com/cmavro/Graph-InfoClust-GIC) and WalkPool by Pan et al. (https://github.com/DaDaCheng/WalkPooling). There are also some baseline model codes taken from OGB implementations (https://github.com/snap-stanford/ogb) and other Pytorch Geometric implementations (https://github.com/pyg-team/pytorch_geometric).
Please cite our work if you find it useful in any way.
@misc{https://doi.org/10.48550/arxiv.2301.12562,
doi = {10.48550/ARXIV.2301.12562},
url = {https://arxiv.org/abs/2301.12562},
author = {Louis, Paul and Jacob, Shweta Ann and Salehi-Abari, Amirali},
keywords = {Machine Learning (cs.LG), Social and Information Networks (cs.SI), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {Simplifying Subgraph Representation Learning for Scalable Link Prediction},
publisher = {arXiv},
year = {2023},
copyright = {arXiv.org perpetual, non-exclusive license}
}
The PDF of our preprint is available at https://arxiv.org/pdf/2301.12562.pdf