This is the official code base of the paper
Inductive Logical Query Answering in Knowledge Graphs
Mikhail Galkin, Zhaocheng Zhu, Hongyu Ren, Jian Tang
Important: the camera-ready NeurIPS'22 version was identified to have datasets with possible test set leakages. The new version including this repository and updated Arxiv submission have new datasets and experiments where this issue has been fixed. We recommend to use the latest version of datasets (2.0 on Zenodo) and experiments (v2 on arXiv) for further comparisons.
Inductive query answering is the setup where at inference time an underlying graph can have new, unseen entities. In this paper, we study a practical inductive setup when a training graph is extended with more nodes and edges at inference time. That is, an inference graph is always a superset of the training graph. Note that the inference graph always shares the same set of relation types with the training graph.
The two big implications of the inductive setup:
- test queries involve new, unseen nodes where answers can be both seen and unseen nodes;
- training queries now might have more answers among new nodes.
The two inductive approaches implemented in this repo:
- NodePiece-QE (Inductive node representations): based on NodePiece and CQD. Train on 1p link prediction, inference-only zero-shot logical query answering over unseen entities. The NodePiece encoder can be extended with the additional GNN encoder (CompGCN) that is denoted as NodePiece-QE w/ GNN in the paper.
- Inductive GNN-QE (Inductive relational structure representations): based on GNN-QE. Trainable on complex queries, achieves higher performance than NodePiece-QE but is more expensive to train.
We additionally provide a dummy Edge-type Heuristic (model.HeuristicBaseline
) that only considers possible tails of the last relation projection step of any query pattern.
We created 10 new inductive query answering datasets where validation/test graphs extend the training graph and contain new entities:
- Small-scale: 9 datasets based on FB15k-237 with the ratio of inference-to-train nodes varies from 106% to 550%, total of 15k nodes for various splits.
- Large-scale: 1 dataset based on OGB WikiKG2 with the fixed ratio of 133% and 1.5M training nodes but with 500K new nodes and 5M new edges at inference.
Datasets Description
Each dataset is a zip archive containing 17 files:
train_graph.txt
(pt for wikikg) - original training graphval_inference.txt
(pt) - inference graph (validation split), new nodes in validation are disjoint with the test inference graphval_predict.txt
(pt) - missing edges in the validation inference graph to be predicted.test_inference.txt
(pt) - inference graph (test splits), new nodes in test are disjoint with the validation inference graphtest_predict.txt
(pt) - missing edges in the test inference graph to be predicted.train/valid/test_queries.pkl
- queries of the respective split, 14 query types for fb-derived datasets, 9 types for WikiKG (EPFO-only)*_answers_easy.pkl
- easy answers to respective queries that do not require predicting missing links but only edge traversal*_answers_hard.pkl
- hard answers to respective queries that DO require predicting missing links and against which the final metrics will be computedtrain_answers_valid.pkl
- the extended set of answers for training queries on the bigger validation graph, most of training queries have at least 1 more new answers. This is supposed to be an inference-only dataset to measure faithfulness of trained modelstrain_answers_test.pkl
- the extended set of answers for training queries on the bigger test graph, most of training queries have at least 1 more new answers. This is supposed to be an inference-only dataset to measure faithfulness of trained modelsog_mappings.pkl
- contains entity2id / relation2id dictionaries mapping local node/relation IDs from a respective dataset to the original fb15k237 / wikikg2stats.txt
- a small file with dataset stats
All datasets are available on Zenodo, please refer to v2.0 of the datasets. The datasets will be downloaded automatically upon the first run.
Additionally, we provide lightweight dumps (Zenodo) just of those graphs (without queries and answers) for training simple link prediction and KG completion models. Please refer to v2.0 of the datasets.
The dependencies can be installed via either conda or pip. NodePiece-QE and GNN-QE are compatible with Python 3.7/3.8/3.9 and PyTorch >= 1.8.0.
conda install torchdrug pytorch cudatoolkit -c milagraph -c pytorch -c pyg
conda install pytorch-sparse pytorch-scatter -c pyg
conda install easydict pyyaml -c conda-forge
pip install torchdrug torch
pip install easydict pyyaml
pip install wandb tensorboardx
Then install torch-scatter
and torch-sparse
following the instructions in the Github repo.
For example, for PyTorch 1.10 and CUDA 10.2:
pip install torch-scatter torch-sparse -f https://data.pyg.org/whl/torch-1.10.0+cu102.html
Conceptually, running NodePiece-QE consists of two parts:
- Training a neural link predictor using NodePiece (+ optional GNN), saving materialized embeddings of the test graph.
- Running CQD over the saved embeddings.
Step 1: Training a Link Predictor
Use the NodePiece
model with the task InductiveKnowledgeGraphCompletion
applied to the dataset of choice.
We prepared 5 configs for FB15k-237-derived datasets in the config/lp_pretraining
directory, 2 for NodePiece w/o GNN
and 3 for NodePiece w/ GNN, following the reported hyperparameters in the paper.
_550
configs have a higher input_dim
so we decided to have a dedicated file for them to send less params to the
training script.
We also provide 2 configs for the WikiKG graph and recommend running pre-training in the multi-gpu mode due to the size of the graph.
Example of training a vanilla NodePiece on the 175% dataset:
python script/run.py -c config/lp_pretraining/nodepiece_nognn.yaml --ratio 175 --temp 0.5 --epochs 2000 --gpus [0] --logger console
NodePiece + GNN on the 175% dataset:
python script/run.py -c config/lp_pretraining/nodepiece_gnn.yaml --ratio 175 --temp 1.0 --epochs 1000 --gpus [0] --logger console
For datasets of ratios 106-150 use the 5-layer GNN config config/lp_pretraining/nodepiece_gnn.yaml
.
- Use
--gpus null
to run the scripts on a CPU. - Use
--logger wandb
to send training logs to wandb, don't forget to prepend env variableWANDB_ENTITY=(your_entity)
before executing the python script.
After training, materialized entity and relation embeddings of the test graph will be stored in the output_dir
folder.
WikiKG training requires a vocabulary of mined NodePiece anchors, we ship a precomputed
vocab 20000_anchors_d0.4_p0.4_r0.2_25sp_bfs.pkl
together with the wikikg.zip
archive.
You can mine your own vocab playing around with the NodePieceTokenizer
-- mining is implemented on a GPU and
should be much faster than the original NodePiece implementation.
An example WikiKG link prediction pre-training config should contain --vocab
param to the mined vocab, e.g.,
python script/run.py -c config/lp_pretraining/wikikg_nodepiece_nognn.yaml --gpus [0] --vocab /path/to/pickle/vocab.pkl
We highly recommend training both no-GNN and GNN versions of NodePiece on WikiKG using several GPUs, for example
python -m torch.distributed.launch --nproc_per_node=2 script/run.py -c config/lp_pretraining/wikikg_nodepiece_nognn.yaml --gpus [0,1] --vocab /path/to/pickle/vocab.pkl
Step 2: CQD Inference
Use the pre-trained link predictor to run CQD inference over EPFO queries (negation is not supported in this version of CQD).
Example of running CQD on the pre-trained 200d NodePiece w/ GNN model over the 175% dataset
- Note that we need to specify a 2x smaller embedding dimension of the training model as by default we train a ComplEx model with two parts - real and complex;
- Use the full path to the embeddings of the pre-trained models, they are named smth like
/path/epoch_1000_ents
and/path/epoch_1000_rels
, so just use the common prefix/path/epoch_1000
.
python cqd/main.py --cuda --do_test --data_path ./data/175 -d 100 -cpu 6 --log_steps 10000 --test_log_steps 10000 --geo cqd --print_on_screen --cqd-k 32 --cqd-sigmoid --tasks "1p.2p.3p.2i.3i.ip.pi.2u.up" --inductive --checkpoint_path /path/epoch_1000 --skip_tr
To evaluate training queries on the bigger test graphs, use the argument --eval_train
python cqd/main.py --cuda --do_test --data_path ./data/175 -d 100 -cpu 6 --log_steps 10000 --test_log_steps 10000 --geo cqd --print_on_screen --cqd-k 32 --cqd-sigmoid --tasks "1p.2p.3p.2i.3i.ip.pi.2u.up" --inductive --checkpoint_path /path/epoch_1000 --eval_train
To train GNN-QE and evaluate on the valid/test queries and desired dataset ratio, use the gnnqe_main.yaml
config.
Example on the 175% dataset:
python script/run.py -c config/complex_query/gnnqe_main.yaml --ratio 175 --gpus [0]
Alternatively, you may specify --gpus null
to run GNN-QE on a CPU.
The hyperparameters are designed for 32GB GPUs, but you may adjust the batch size in the config file to fit a smaller GPU memory.
To run GNN-QE with multiple GPUs or multiple machines, use the following commands
python -m torch.distributed.launch --nproc_per_node=2 script/run.py -c config/complex_query/gnnqe_main.yaml --gpus [0,1]
python -m torch.distributed.launch --nnodes=4 --nproc_per_node=4 script/run.py -c config/complex_query/gnnqe_main.yaml --gpus [0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3]
To evaluate training queries on the bigger test graphs, use the config gnnqe_eval_train.yaml
and specify the
checkpoint --checkpoint
of the trained model. The best performing checkpoint is written in the log files after training the main
config.
For example, if the best performing 175% model is model_epoch_1.pth
then the path will be:
python script/run.py -c config/complex_query/gnnqe_eval_train.yaml --ratio 175 --gpus [0] --checkpoint /path/to/model/model_epoch_1.pth
Finally, we provide configs for the inference-only rule-based heuristic baseline
that only considers possible tails of the last relation projection step of any query pattern.
The two configs are config/complex_query/heuristic_main.yaml
and config/complex_query/heuristic_eval_train.yaml
.
To run the baseline on test queries (for example, on the 175% dataset):
python script/run.py -c config/complex_query/heuristic_main.yaml --ratio 175 --gpus [0]
To run the baseline on train queries over bigger test graphs:
python script/run.py -c config/complex_query/heuristic_eval_train.yaml --ratio 175 --gpus [0]
If you find this project useful in your research, please cite the following paper
@inproceedings{galkin2022inductive,
title={Inductive Logical Query Answering in Knowledge Graphs},
author={Mikhail Galkin and Zhaocheng Zhu and Hongyu Ren and Jian Tang},
booktitle={Advances in Neural Information Processing Systems},
year={2022},
url={https://openreview.net/forum?id=-vXEN5rIABY}
}