Harnessing Explanations: LLM-to-LM Interpreter for Enhanced Text-Attributed Graph Representation Learning

Citation

@misc{he2023harnessing,
      title={Harnessing Explanations: LLM-to-LM Interpreter for Enhanced Text-Attributed Graph Representation Learning}, 
      author={Xiaoxin He and Xavier Bresson and Thomas Laurent and Adam Perold and Yann LeCun and Bryan Hooi},
      year={2023},
      eprint={2305.19523},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

0. Python environment setup with Conda

conda create --name TAPE python=3.8
conda activate TAPE

conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
conda install -c pyg pytorch-sparse
conda install -c pyg pytorch-scatter
conda install -c pyg pytorch-cluster
conda install -c pyg pyg
pip install ogb
conda install -c dglteam/label/cu113 dgl
pip install yacs
pip install transformers
pip install --upgrade accelerate

1. Download TAG datasets

A. Original text attributes

Dataset	Description
ogbn-arxiv	The OGB provides the mapping from MAG paper IDs into the raw texts of titles and abstracts. Download the dataset here, unzip and move it to `dataset/ogbn_arxiv_orig`.
ogbn-products (subset)	The dataset is located under `dataset/ogbn_products_orig`.
arxiv_2023	Download the dataset here, unzip and move it to `dataset/arxiv_2023_orig`.
Cora	Download the dataset here, unzip and move it to `dataset/cora_orig`.
PubMed	Download the dataset here, unzip and move it to `dataset/PubMed_orig`.

B. LLM responses

Dataset	Description
ogbn-arxiv	Download the dataset here, unzip and move it to `gpt_responses/ogbn-arxiv`.
ogbn-products (subset)	Download the dataset here, unzip and move it to `gpt_responses/ogbn-products`.
arxiv_2023	Download the dataset here, unzip and move it to `gpt_responses/arxiv_2023`.
Cora	Download the dataset here, unzip and move it to `gpt_responses/cora`.
PubMed	Download the dataset here, unzip and move it to `gpt_responses/PubMed`.

2. Fine-tuning the LMs

To use the orginal text attributes

WANDB_DISABLED=True TOKENIZERS_PARALLELISM=False CUDA_VISIBLE_DEVICES=0,1,2,3 python -m core.trainLM dataset ogbn-arxiv

To use the GPT responses

WANDB_DISABLED=True TOKENIZERS_PARALLELISM=False CUDA_VISIBLE_DEVICES=0,1,2,3 python -m core.trainLM dataset ogbn-arxiv lm.train.use_gpt True

3. Training the GNNs

To use different GNN models

python -m core.trainEnsemble gnn.model.name MLP
python -m core.trainEnsemble gnn.model.name GCN
python -m core.trainEnsemble gnn.model.name SAGE
python -m core.trainEnsemble gnn.model.name RevGAT gnn.train.lr 0.002 gnn.train.dropout 0.75

To use different types of features

# Our enriched features
python -m core.trainEnsemble gnn.train.feature_type TA_P_E

# Our individual features
python -m core.trainGNN gnn.train.feature_type TA
python -m core.trainGNN gnn.train.feature_type E
python -m core.trainGNN gnn.train.feature_type P

# OGB features
python -m core.trainGNN gnn.train.feature_type ogb

4. Reproducibility

Use run.sh to run the codes and reproduce the published results.

This repository also provides the checkpoints for all trained models (*.ckpt) and the TAPE features (*.emb) used in the project. Please donwload them here.

arxiv-2023 dataset

The codes for constructing and processing the arxiv-2023 dataset are provided here.

tmacmilan/TAPE