qags: Question Answering and Generation for Summarization

This repo contains the code for the paper Asking and Answering Questions to Evaluate the Factual Consistency of Summaries, which appeared at ACL 2020.

Usage

To compute QAGS scores, we need to

generate questions
answer questions
compare answers

1. Generating Questions

Extracting answer candidates

We use an answer-conditional question generation model, so we first need to extract answer candidates. Use the following command, where data_file is a text file containining an example per line and out_dir is the directory to write the processed files to. The script will produce test.txt, test_{n_ans_per_txt}.txt, test_w_{n_ans_per_txt}ans.txt in out_dir, which respectively contain the examples, the extracted answers, and the answers and examples formatted to feed into the QG model.

python qg_utils.py --command extract_ans \
                   --data_file ${data_file} \
                   --out_dir ${out_dir}

Generating questions

To generate the questions, we rely on BART finetuned on NewsQA, implemented in fairseq. A frozen version of fairseq for doing so is available in qags/fairseq. Our pretrained QG model is available here.

To generate from these models, we must first preprocess the data (tokenize and binarize) using the command: ./fairseq/scripts/aw/preprocess.sh preprocess. In the script, make sure to change dat_dir to point to the directory containing your files. The script expects dat_dir to contain test.src and test.trg, where test.src are the files that will actually be fed into the QG model to generate from; test.trg can be a dummy file with the same number of lines (e.g., a copy of test.src).

Then to generate, use command ./scripts/gen_qg.sh. Change model_path to point to the pretrained QG checkpoint, data_path to the directory containing the processed data (typically the processed directory created during preprocessing), and out_file for the file to log to. Due to a code quirk, in fairseq/fairseq/models/summerization_encoder_only.py, set HACK_PATH (line 107) to the best_pretrained_bert.pt checkpoint, located here.

Finally, extract the generated questions using

python qg_utils.py --command extract-gen \
                   --data_file ${fseq_log_file} \
                   --out_dir ${out_dir}

which will extract the generations and the corresponding probabilities respectively to gen.txt and prob.txt in out_dir.

2. Answering Questions

To prepare the QA data, use the following command:

python qa_utils.py --command format-qa-data --out_dir tmp \
                   --src_txt_file ${src_txt_file} --gen_txt_file ${gen_txt_file} \
                   --gen_qst_file ${gen_qst_file} --gen_prob_file ${gen_prob_file}

where gen_{qst/prob}_file are generated from the previous step (gen.txt and prob.txt). {src/gen}_txt_file are respectively the source and model-generated texts (e.g. for summarization, the source articles and model-generated summaries to be evaluated). As part of this step, we filter questions by quality using a number of heuristics. Most importantly, we filter questions by enforcing answer consistency: We use a QA model to answer the generated questions, and if the predicted answer doesn't match the original answer, we throw out the question. To do this, we need to run the QA model on the generated questions, which will produce an answer file. For this step, use the flag --use_all_qsts and then run the QA model on the resulting data file.

Once you have answers for each question, we need to compare the expected and predicted answers, which we do so via the flags --use_exp_anss --gen_ans_file ${gen_ans_file} --gen_prd_file ${gen_prd_file}, where the latter two respectively contain the expected and the predicted answers.

To evaluate our QA models, use the following command to evaluate the model on pred_file and write the predictions to out_dir/out_file Our models are based on pytorch-pretrained-BERT (now transformers) and pretrained checkpoints are located here. Make sure model_dir points to the QA model directory. To compute QAGS scores, evaluate the QA model using the both the article as context and the summary as context, so you will need to run this command twice.

python finetune_pt_squad.py \
              --bert_model bert-large-uncased \
              --load_model_from_dir ${model_dir} \
              --version_2_with_negative \
              --do_lower_case \
              --do_predict \
              --predict_file ${pred_file} \
              --output_dir ${out_dir} \
              --prediction_file ${out_file} \
              --overwrite_output_dir

3. Comparing Answers

Finally, to get the actual QAGS scores, we compare answers. The following command will write the scores to out_dir/qags_scores.txt.

python qa_utils.py --command compute-qags \
                   --src-ans-file ${src_ans_file} \
                   --trg-ans-file ${trg_ans_file} \
                   --out-dir ${out_dir}

Data

The crowdsourced annotations of summary sentences we collected are available in data/mturk_{cnndm,xsum}.jsonl. Each line is an article, model-generated summary divided into sentences, and three annotations per sentence. Each annotation is a binary choice of whether or not the summary sentence is factually supported by the article, as well as an anonymized annotator ID.

For CNNDM, the summarization model is Bottom-Up Summarization (Gehrmann et al., 2017). For XSUM, the summarization model is BART finetuned on the XSUM training data.

Citation

If you use this code or data, please cite us.

@article{wang2020asking,
   title={Asking and Answering Questions to Evaluate the Factual Consistency of Summaries},
   url={http://dx.doi.org/10.18653/v1/2020.acl-main.450},
   DOI={10.18653/v1/2020.acl-main.450},
   journal={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
   publisher={Association for Computational Linguistics},
   author={Wang, Alex and Cho, Kyunghyun and Lewis, Mike},
   year={2020}
}

danieldeutsch/qags