/MLRC2020-EmbedKGQA

This is the code for the MLRC2020 challenge w.r.t. the ACL 2020 paper Improving Multi-hop Question Answering over Knowledge Graphs using Knowledge Base Embeddings

Primary LanguagePythonApache License 2.0Apache-2.0

SWH SWH

EmbedKGQA: Reproduction and Extended Study

Additional Experiments

  • Knowledge Graph Embedding model
    • TuckER
    • Tested on {MetaQA_full, MetaQA_half} datasets
  • Question embedding models

Requirements

  • Python >= 3.7.5, pip
  • zip, unzip
  • Docker (Recommended)
  • Pytorch version 1.3.0a0+24ae9b5. For more info, visit here.

Helpful pointers

  • Docker Image: Cuda-Python[2] can be used. Use the runtime tag.

    • docker run -itd --rm --runtime=nvidia -v /raid/kgdnn/:/raid/kgdnn/ --name embedkgqa__4567 -e NVIDIA_VISIBLE_DEVICES=4,5,6,7  -p 7777:7777 qts8n/cuda-python:runtime
  • Alternatively, Docker Image: Embed_KGQA[3] can be used as well. It's build upon [2] and contains all the packages for conducting the experiments.

    • Use env tag for image without models.
    • Use env-models tag for image with models.
    • docker run -itd --rm --runtime=nvidia -v /raid/kgdnn/:/raid/kgdnn/ --name embedkgqa__4567 -e NVIDIA_VISIBLE_DEVICES=4,5,6,7  -p 7777:7777 jishnup/embed_kgqa:env
    • All the required packages and models (from the extended study with better performance) are readily available in [3].
      • Model location within the docker container: /raid/mlrc2020models/
        • /raid/mlrc2020models/embeddings/ contain the KG embedding models.
        • /raid/mlrc2020models/qa_models/ contain the QA models.
  • The experiments have been done using [2]. The requirements.txt packages' version have been set accordingly. This may vary w.r.t. [1].

  • KGQA/LSTM and KGQA/RoBERTa directory nomenclature hasn't been changed to avoid unnecessary confusion w.r.t. the original codebase[1].

  • fbwq_full and fbwq_full_new are the same but independent existence is required because

    • Pretrained ComplEx model uses fbwq_full_new as the dataset name
    • Trained SimplE model uses fbwq_full as the dataset name
  • No fbwq_full_new dataset was found in the data shared by the author[1], so went ahead with this setting.

  • Also, pretrained qa_models were absent in the data shared. The reproduction results are based on training scheme used by us.

  • For training QA datasets, use batch_size >= 2.

Get started

# Clone the repo
git clone https://github.com/jishnujayakumar/MLRC2020-EmbedKGQA && cd "$_"

# Set a new env variable called EMBED_KGQA_DIR with MLRC2020-EmbedKGQA/ directory's absolute path as value
# If using bash shell, run 
echo 'export EMBED_KGQA_DIR=`pwd`' >> ~/.bash_profile && source ~/.bash_profile

# Change script permissions
chmod -R 700 scripts/

# Initial setup
./scripts/initial_setup.sh

# Download and unzip, data and pretrained_models from the original EmbedKGQA paper
./scripts/download_artifacts.sh

# Install LibKGE
./scripts/install_libkge.sh

Train KG Embeddings

  • Steps to train KG embeddings.

Train QA Datasets

Hyperparameters in the following commands are set w.r.t. [1].

MetaQA

# Method: 1
cd $EMBED_KGQA_DIR/KGQA/LSTM;
python main.py  --mode train \
            --nb_epochs 100 \
            --relation_dim 200 \
            --hidden_dim 256 \
            --gpu 0 \ #GPU-ID
            --freeze 0 \
            --batch_size 64 \
            --validate_every 4 \
            --hops <1/2/3> \ #n-hops
            --lr 0.0005 \
            --entdrop 0.1 \ 
            --reldrop 0.2 \
            --scoredrop 0.2 \
            --decay 1.0 \
            --model <ComplEx/TuckER> \ #KGE models
            --patience 10 \
            --ls 0.0 \
            --use_cuda True \ #Enable CUDA
            --kg_type <half/full>

        
# Method: 2
# Modify the hyperparameters in the script file w.r.t. your usecase
$EMBED_KGQA_DIR/scripts/train_metaQA.sh \
    <ComplEX/TuckER> \
    <half/full> \
    <1/2/3> \
    <batch_size> \
    <gpu_id> \
    <relation_dim>

WebQuestionsSP

# Method: 1
cd $EMBED_KGQA_DIR/KGQA/RoBERTa;
python main.py  --mode train \
                --relation_dim 200 \
                --que_embedding_model RoBERTa \
                --do_batch_norm 0 \
                --gpu 0 \
                --freeze 1 \
                --batch_size 16 \
                --validate_every 10 \
                --hops webqsp_half \
                --lr 0.00002 \
                --entdrop 0.0 
                --reldrop 0.0 \
                --scoredrop 0.0 \
                --decay 1.0 \
                --model ComplEx \
                --patience 20 \
                --ls 0.0 \
                --l3_reg 0.001 \
                --nb_epochs 200 \
                --outfile delete

# Method: 2
# Modify the hyperparameters in the script file w.r.t. your usecase
$EMBED_KGQA_DIR/scripts/train_webqsp.sh \
    <ComplEx/SimplE> \
    <RoBERTa/ALBERT/XLNet/Longformer/SentenceTransformer> \
    <half/full> \
    <batch_size> \
    <gpu_id> \
    <relation_dim>

Test QA Datasets

Set the mode parameter as test (keep the other hyperparameters same as used in training)

Helpful links

Citation:

Please cite the following if you incorporate our work.

@article{P:2021,
  author = {P, Jishnu Jaykumar and Sardana, Ashish},
  title = {{[Re] Improving Multi-hop Question Answering over Knowledge Graphs using Knowledge Base Embeddings}},
  journal = {ReScience C},
  year = {2021},
  month = may,
  volume = {7},
  number = {2},
  pages = {{#15}},
  doi = {10.5281/zenodo.4834942},
  url = {https://zenodo.org/record/4834942/files/article.pdf},
  code_url = {https://github.com/jishnujayakumar/MLRC2020-EmbedKGQA},
  code_doi = {},
  code_swh = {swh:1:dir:c95bc4fec7023c258c7190975279b5baf6ef6725},
  data_url = {},
  data_doi = {},
  review_url = {https://openreview.net/forum?id=VFAwCMdWY7},
  type = {Replication},
  language = {Python},
  domain = {ML Reproducibility Challenge 2020},
  keywords = {knowledge graph, embeddings, multi-hop, question-answering, deep learning}
}

Following 3 options are available for any clarification, comments or suggestions