Lianghao Xia, Ben Kao, and Chao Huang* (*Correspondence)
Presenting OpenGraph, a foundation graph model distilling zero-shot graph generalizability from LLMs.
To achieve this goal, OpenGraph addresses several key technical challenges:
- We propose a unified graph tokenizer to adapt our graph model to generalize well on unseen graph data, even when the underlying graph properties differ significantly from those encountered during training.
- We develop a scalable graph transformer as the foundational encoder, which effectively and efficiently captures node-wise dependencies within the global topological context.
- We introduce a data augmentation mechanism enhanced by a large language model (LLM) to alleviate the limitations of data scarcity in real-world scenarios.
Extensive experiments validate the effectiveness of our framework. By adapting OpenGraph to new graph characteristics and comprehending the nuances of diverse graphs, our approach achieves remarkable zero-shot graph learning performance across various settings and domains.
You need to unzip some of the data files in datasets/
. Download the pre-trained models using the link in Models/readme
. Our experiments were conducted with the following package versions:
- python==3.10.13
- torch==1.13.0
- numpy==1.23.4
- scipy==1.9.3
Here is a brief overview of the code structures. The explanations for each directory are enclosed in quotes (##...##). For a more detailed version, please refer to the full version listed at the end of this readme.
./
│ └── README.md
│ ├── History/ ## Training history of pre-trained models ##
│ ├── Models/ ## Pre-trained models ##
│ ├── datasets/
│ ├── graph_generation/ ## Code and examples for graph generation ##
│ ├── imgs/ ## Images used in readme ##
│ ├── link_prediction/ ## code for link prediction and pre-training ##
│ │ ├── data_handler.py
│ │ ├── main.py
│ │ ├── model.py
│ │ └── params.py
│ │ ├── Utils/
│ │ │ └── TimeLogger.py
│ ├── node_classification/ ## code for testing on node classification ##
│ │ ├── data_handler.py
│ │ ├── main.py
│ │ ├── model.py
│ │ └── params.py
│ │ ├── Utils/
│ │ │ └── TimeLogger.py
cd link_prediction/
python main.py --load pretrn_gen1 --epoch 0 # test on OGBL-Collab, ML-1M, ML-10M
python main.py --load pretrn_gen0 --tstdata amazon-book --epoch 0 # test on Amazon-Book
python main.py --load pretrn_gen2 --tstdata ddi --epoch 0 # test on OGBL-ddi
cd ../node_classification/
python main.py --load pretrn_gen1 --tstdata cora # test on Cora
python main.py --load pretrn_gen1 --tstdata citeseer # test on Citeseer
python main.py --load pretrn_gen1 --tstdata pubmed # test on Pubmed
cd ../link_prediction/
python main.py --save pretrn_gen1
python main.py --trndata gen0 --tstdata amazon-book --save pretrn_gen0
python main.py --trndata gen2 --tstdata ddi --save pretrn_gen2
To explore pretraining with multiple different pre-training and testing datasets, modify trn_datasets
and tst_datasets
in line 241 of link_prediction/main.py
.
The graph generation code is in graph_generation/
. A toy dataset of small size is given. You need to fill in your OpenAI key in Utils.py
and itemCollecting_dfsIterator.py
first. To generate your dataset, modify the descs
and hyperparams
dicts, and follow the following procedure:
cd graph_generation/
python itemCollecting_dfsIterator.py
python instance_number_estimation_hierarchical.py
python embedding_generation.py
python human_item_generation_gibbsSampling_embedEstimation.py
python make_adjs.py
Below shows our prompt template, as well as examples for prompt configurations and generated nodes.
OpenGraph achives best performance under the 0-shot setting, compared to baselines trained/tuned with 1-shot and 5-shot data.
We studied the influence of using different pre-training datasets. Results below indicate that:
- The generation techniques (Norm, Loc, Topo) have positive effects on performance.
- Real-world datasets (Yelp2018, Gowalla) may yield worse results compared to our generated ones.
- A relevant pre-training dataset (ML-10M for test data ML-1M and ML-10M) results in superior performance.
We tuned configurations of our unified graph tokenizer, by adjusting the order of graph smoothing, and replacing our topology-aware projection with alternatives. Our findings include:
- Adjacency smoothing is important, as OpenGraph with 0-order smoothing yields inferior performance.
- Topology-aware projection is superior in performance. Alternatives include One-hot which learns a big and unified representation table for all datasets, Random which holds no assumption for the node-wise relations and distributes them uniformly, Degree which is a widely-used method for non-attributed graphs and seems applicable for cross-graph scenario.
We ablated the two sampling techniques in the graph transformer, and show their positive effects on both memory and time costs below. Suprisingly, token sequence sampling shows a positive effect over the model performance.
If you find this work useful for your research, please consider citing our paper:
@inproceedings{xia2024opengraph,
title={OpenGraph: Towards Open Graph Foundation Models},
author={Xia, Lianghao and Kao, Ben and Huang, Chao},
booktitle={EMNLP},
year={2024}
}
./
│ └── README.md
│ ├── History/ ## Training history of pre-trained models ##
│ │ ├── pretrn_gen0.his
│ │ ├── pretrn_gen2.his
│ │ └── pretrn_gen1.his
│ ├── Models/ ## Pre-trained models ##
│ │ └── readme ## Download pre-trained models using the link inside ##
│ ├── datasets/
│ │ ├── amazon-book/
│ │ │ ├── fewshot_mat_1.pkl
│ │ │ ├── trn_mat.pkl.zip ## Unzip it manually ##
│ │ │ ├── tst_mat.pkl
│ │ │ └── fewshot_mat_5.pkl
│ │ ├── citeseer/
│ │ │ ├── adj_-1.pkl
│ │ │ ├── adj_1.pkl
│ │ │ ├── adj_5.pkl
│ │ │ ├── feats.pkl.zip ## Unzip it manually ##
│ │ │ ├── label.pkl
│ │ │ ├── mask_-1.pkl
│ │ │ ├── mask_1.pkl
│ │ │ └── mask_5.pkl
│ │ ├── collab/
│ │ │ ├── fewshot_mat_5.pkl
│ │ │ ├── trn_mat.pkl.zip ## Unzip it manually ##
│ │ │ ├── tst_mat.pkl
│ │ │ ├── val_mat.pkl
│ │ │ └── fewshot_mat_1.pkl
│ │ ├── cora/
│ │ │ ├── adj_-1.pkl
│ │ │ ├── adj_1.pkl
│ │ │ ├── adj_5.pkl
│ │ │ ├── feats.pkl
│ │ │ ├── label.pkl
│ │ │ ├── mask_-1.pkl
│ │ │ ├── mask_1.pkl
│ │ │ └── mask_5.pkl
│ │ ├── ddi/
│ │ │ ├── fewshot_mat_1.pkl
│ │ │ ├── trn_mat.pkl.zip ## Unzip it manually ##
│ │ │ ├── tst_mat.pkl
│ │ │ ├── val_mat.pkl
│ │ │ └── fewshot_mat_5.pkl
│ │ ├── gen0/
│ │ │ ├── trn_mat.pkl
│ │ │ ├── val_mat.pkl
│ │ │ └── tst_mat.pkl
│ │ ├── gen1/
│ │ │ ├── trn_mat.pkl
│ │ │ ├── tst_mat.pkl
│ │ │ └── val_mat.pkl
│ │ ├── gen2/
│ │ │ ├── trn_mat.pkl
│ │ │ ├── val_mat.pkl
│ │ │ └── tst_mat.pkl
│ │ ├── ml10m/
│ │ │ ├── fewshot_mat_1.pkl
│ │ │ ├── trn_mat.pkl.zip ## Unzip it manually ##
│ │ │ ├── tst_mat.pkl.zip ## Unzip it manually ##
│ │ │ └── fewshot_mat_5.pkl
│ │ ├── ml1m/
│ │ │ ├── fewshot_mat_5.pkl
│ │ │ ├── trn_mat.pkl
│ │ │ ├── tst_mat.pkl
│ │ │ └── fewshot_mat_1.pkl
│ │ ├── pubmed/
│ │ │ ├── adj_-1.pkl
│ │ │ ├── adj_1.pkl
│ │ │ ├── feats.pkl.zip ## Unzip it manually ##
│ │ │ ├── label.pkl
│ │ │ ├── mask_-1.pkl
│ │ │ ├── mask_1.pkl
│ │ │ ├── mask_5.pkl
│ │ │ └── adj_5.pkl
│ ├── graph_generation/ ## Code and examples for graph generation ##
│ │ ├── embedding_generation.py ## Node embedding generation ##
│ │ ├── human_item_generation_gibbsSampling_embedEstimation.py ## Edge generation ##
│ │ ├── instance_number_estimation_hierarchical.py ## Estimate amount for each node. Not mentioned in the paper. ##
│ │ ├── itemCollecting_dfsIterator.py ## Node generation ##
│ │ ├── make_adjs.py ## Making datasets for generated gaphs ##
│ │ └── Utils.py
│ │ ├── Exp_Utils/
│ │ │ ├── Emailer.py ## A tool to send warning email for experiments ##
│ │ │ └── TimeLogger.py
│ │ ├── gen_results/
│ │ │ ├── tree_wInstanceNum_products_e-commerce platform like Amazon.pkl ## Tree data structure ##
│ │ │ └── products_e-commerce platform like Amazon.txt ## Node list ##
│ │ │ ├── datasets/
│ │ │ │ ├── gen_data_ecommerce/
│ │ │ │ │ ├── embedding_dict.pkl
│ │ │ │ │ ├── item_list.pkl
│ │ │ │ │ └── interaction_base-0_iter-0.pkl ## generated edges ##
│ │ │ │ │ ├── res/
│ │ │ │ │ │ ├── iter-0_imap.pkl ## Id map for nodes ##
│ │ │ │ │ │ ├── iter-0_test.pkl
│ │ │ │ │ │ ├── iter-0_train.pkl
│ │ │ │ │ │ ├── iter-0_valid.pkl
│ │ │ │ │ │ └── interaction_fuse_iter-0.pkl
│ │ │ ├── tem/ ## Temporary files for node generation ##
│ │ │ │ ├── e-commerce platform like Amazon_depth1_products
│ │ │ │ ├── e-commerce platform like Amazon_depth2_products, Automotive
│ │ │ │ ├── e-commerce platform like Amazon_depth2_products, Baby
│ │ │ │ ├── e-commerce platform like Amazon_depth2_products, Beauty
│ │ │ │ ├── e-commerce platform like Amazon_depth2_products, Books
│ │ │ │ ├── e-commerce platform like Amazon_depth2_products, Clothing
│ │ │ │ ├── e-commerce platform like Amazon_depth2_products, Electronics
│ │ │ │ ├── e-commerce platform like Amazon_depth2_products, Handmade
│ │ │ │ ├── e-commerce platform like Amazon_depth2_products, Health and Personal Care
│ │ │ │ ├── e-commerce platform like Amazon_depth2_products, Home Improvement
│ │ │ │ ├── e-commerce platform like Amazon_depth2_products, Industrial and Scientific
│ │ │ │ ├── e-commerce platform like Amazon_depth2_products, Jewelry
│ │ │ │ ├── e-commerce platform like Amazon_depth2_products, Musical Instruments
│ │ │ │ ├── e-commerce platform like Amazon_depth2_products, Office Products
│ │ │ │ ├── e-commerce platform like Amazon_depth2_products, Pet Supplies
│ │ │ │ ├── e-commerce platform like Amazon_depth2_products, Tools and Home Improvement
│ │ │ │ ├── e-commerce platform like Amazon_depth2_products, Toys
│ │ │ │ └── e-commerce platform like Amazon_depth2_products, Sports and Outdoors
│ ├── imgs/ ## Images used in readme ##
│ │ ├── framework.png
│ │ ├── intro.png
│ │ ├── performance.png
│ │ └── article cover.jpg
│ ├── link_prediction/ ## code for link prediction and pre-training ##
│ │ ├── data_handler.py
│ │ ├── main.py
│ │ ├── model.py
│ │ └── params.py
│ │ ├── Utils/
│ │ │ └── TimeLogger.py
│ ├── node_classification/ ## code for testing on node classification ##
│ │ ├── data_handler.py
│ │ ├── main.py
│ │ ├── model.py
│ │ └── params.py
│ │ ├── Utils/
│ │ │ └── TimeLogger.py