Replication Project - Contrast Candidate Generation and Selection

The goal of this project is to reproduce Table 3 of Chen et al's paper which introduces a model for correcting summary hallucinations.

Paper Presentation for CS6741

Introduction

The authors present a method for correcting hallucinations in generated text summaries. They leverage a pre-trained BART-Base model which is fine-tuned to discriminate between "faithful" and "unfaithful" summaries. The training data for fine-tuning is created by artificially corrupting ground truth (human created) summaries.

At a high-level, the process can be broken down into two steps:

Candidate generation: candidate summaries are created by replacing entities and quantities in summaries with entities of compatible semantic types from the source document.
- At training time, entities are replaced in the ground truth summary to create negative examples
- At inference time, entities are replaced in the generated summary to create candidate summaries
Candidate selection: a fine-tuned BART “faithfulness” classifier ranks the candidate summaries according to how faithful they are to the source document. Within the set of candidate summaries, the most faithful summary as predicted by this classifier is the final output of the system.

Methods

In order to replicate Table 3, we set out to:

Generate summaries using the baseline model (BART-Large fine-tuned on XSUM)
Implement and fine-tune the BART-Base faithfulness classifier from positive and artificially generated negative examples of XSUM train
Run evaluation metrics (Rouge-, BERT-, and FEQA-score) for the baseline and correction model on XSUM test

Baseline

We begin by generating summaries for XSum Test using a pre-trained BART-Large that's fine-tuned on XSum.

Code: generate_bart_baseline.py and eval_bert_score.py
Stored summaries, rougeL & BERT-score: data/xsum/facebook-bart-large-xsum-metrics

Data Generation

We make use of code published by the author's papers to run the candidate generation process, and write our own scripts for tokenizing and batching candidate summaries (bart_tokenize.py, prepare_train_dataset.py).

Correction Model

We implement the correction model with a combined cross-entropy and contrastive max-margin loss from scratch (code).

The model is fine-tuned on Colab using training data that was published by the authors.

Evaluation

We compute BERT-score, RougeL and FEQA scores for two sets of summaries:

baseline generated summaries
corrected baseline generated summaries

Code:

Results

Original Table

Our Replication

We are able to replicate the evaluation trends from the paper successfully. Most notably, our trained correction model successfully improves the faithfulness of the baseline generated summaries as measured by the FEQA scores.

There are some subtle differences which we attribute to:

Transformer fine-tuning instability
Variance in the loss due to the small batch sizes (one contrastive pair) of shuffled training data.

Training logs

A key challenge in this replication was computational constraints, given that the full model requires 18 hours of GPU fine-tuning. Furthermore, evaluation using FEQA takes up to 2 hours for the corrected test set (on a GPU), which makes it challenging to evaluate the correction model during training. To overcome this we implemented saved model checkpoints and cached FEQA scores for a sample of the candidate summaries. This enabled us to iteratively evaluate our model during training.

Appendix

Data Preprocessing

For efficiency, we tokenize the datasets in advance of training.

$ python bart_tokenize.py data/paper/val.jsonl data/tokenized/val.tokenized.jsonl

Running FEQA

See this notebook.

Trained models for question generation and question answering systems are under this drive.

Download squad1.0 from Google Drive and place it under evaluation/qa_models directory.
Download checkpoints folder and place it under evaluation/bart_qg directory.
Run python -m spacy download en_core_web_sm

dleve123/topics-in-nlp-repro-project