/KPQA

KPQA is an evaluation metric for generative question answering.

Primary LanguageJupyter Notebook

KPQA

This repository provides an evaluation metric for generative question answering systems based on our NAACL 2021 paper KPQA: A Metric for Generative Question Answering Using Keyphrase Weights.
Here, we provide the code to train KPQA, pretrained model, human annotated data and the code to compute KPQA-metric.

The repository will soon be updated until 6/10 in a more useful form using demo in jupyter notebook.(weights will be uploaded to huggingface models)

Dataset

We provide human judgments of correctness for 4 datasets:MS-MARCO NLG, AVSD, Narrative QA and SemEval 2018 Task 11 (SemEval).
For MS-MARCO NLG and AVSD, we generate the answer using two models for each dataset. For NarrativeQA and SemEval, we preprocessed the dataset from [Evaluating Question Answering Evaluation](https://www.aclweb.org/anthology/D19-5817).

Usage

1. Install Prerequisites

Install packages using "requirements.txt"

2. Download Pretrained Model

We provide the pre-trained KPQA model in the following link.
https://drive.google.com/file/d/1pHQuPhf-LBFTBRabjIeTpKy3KGlMtyzT/view?usp=sharing
Download the "ckpt.zip" and extract it.

3. Compute Metric

You can compute KPQA-metric using "compute_correlation.py"

python compute_correlation.py \
--dataset marco \ # Target dataset to evaluate the metric
--qa_model unilm \ # The model used to generate answer.
--model_dir $CHECKPOINT_DIR \ # Path of checkpoint directory (extract path of "ckpt.zip")

For evaluating various metrics on MS-MARCO NLG dataset, the printed result (correlation with human judgments) will be as follows.

Metrics | Pearson | Spearman
BLEU-1 | 0.369 | 0.337
BLEU-4 | 0.173 | 0.224
ROUGE-L | 0.317 | 0.289
CIDEr | 0.261 | 0.256
BERTScore | 0.469 | 0.445
BLEU-1-KPQA | 0.729 | 0.676
ROUGE-L-KPQA | 0.735 | 0.674
BERTScore-KPQA | 0.698 | 0.66

Train KPQA (optional)

You can train your own KPQA model using the provided dataset or your own dataset using "train.py".
You can train using the default setting with "train_kpqa.sh"

Reference

If you find this repo useful, please consider citing:

@inproceedings{lee2021kpqa,
  title={KPQA: A Metric for Generative Question Answering Using Keyphrase Weights},
  author={Lee, Hwanhee and Yoon, Seunghyun and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Shin, Joongbo and Jung, Kyomin},
  booktitle={Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
  pages={2105--2115},
  year={2021}
}