/CAQA

[EMNLP 2021] PyTorch Implementation of Contrastive Domain Adaptation for Question Answering using Limited Text Corpora

Primary LanguagePython

CAQA

Pytorch Implementation of the EMNLP 2021 Paper Contrastive Domain Adaptation for Question Answering using Limited Text Corpora

We propose a novel framework for domain adaptation called contrastive domain adaptation for QA (CAQA). Specifically, CAQA combines techniques from question generation and domain-invariant learning to answer out-of-domain questions in settings with limited text corpora. Here, we train a QA system on both source data and generated data from the target domain with a contrastive adaptation loss that is incorporated in the training objective. By combining techniques from question generation and domain-invariant learning, our model achieved considerable improvements compared to state-of-the-art baselines

Citing

Please cite the following paper if you use our methods in your research:

@inproceedings{yue2021contrastive,
  title={Contrastive Domain Adaptation for Question Answering using Limited Text Corpora},
  author={Yue, Zhenrui and Kratzwald, Bernhard and Feuerriegel, Stefan},
  booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year={2021}
}

Data & Requirements

Download MRQA datasets with the following command

bash download_data.sh

To run our code you need PyTorch & Transformers, see requirements.txt for our running environment

Train QAGen-T5 Generation Model

Our QAGen-T5 model consists of a question generation model and an answer generation model, we train our models with question_generation and we provide all generated synthetic data here

Train BERT-QA with Contrastive Adaptation Loss

With the generated synthetic data, BERT-QA model can be trained using both SQuAD and the synthetic data, run following command to train a model with the proposed contrastive adaptation loss

CUDA_VISIBLE_DEVICES=0 python src/bert_squad_trans.py \
    --do_lower_case \
    --adaptation_method=smooth_can \
    --do_train \
    --train_both \
    --beta=0.001 \
    --sigma=0.001 \
    --mrqa_train_file=./data/synthetic/qagen_t5large_LM_5_10000_HotpotQA.jsonl \
    --do_predict \
    --predict_squad

Performance

Acknowledgement

During the implementation we base our code mostly on Transformers from Hugging Face, question_generation by Suraj Patil and MRQA-Shared-Task-2019 by Fisch et al. Many thanks to these authors for their great work!