Improving Coverage of Synthetically Generated QA Pairs

This project contains all the code for Sam Mayers' summer internship project, Improving Coverage of Synthetically Generated QA Pairs.

Code is based on the Huggingface library.

Requirements and setup

  • Necessary packages can be found in the requirements.txt file.
  • Python version 3.9

Data:

Dataset for training are DREAM dataset and the NarrativeQA dataset. This data is already preprocessed.

To reformat the original dataset yourself, use preprocess/format_dataset.py

To generate coverage scores for a formatted dataset, use get_coverage_scores.py

To normalize the coverage scores for a formatted dataset with coverage scores, use normalizecoverage.py

Training QAGen:

The idea is to train the model to generate QA pairs given a dataset of dialogues or other texts. To train vanilla BART, run:

python train_coverageloss.py -train_path PATH_TO_TRAIN_DATA1 PATH_TO_TRAIN_DATA2 -val_path PATH_TO_VAL_DATA1 PATH_TO_VAL_DATA2 -save_dir PATH_WHERE_TO_SAVE_TRAINED_MODEL -log_path logs/training_log 

To add in a coverage loss, run:

python train_coverageloss.py -train_path PATH_TO_TRAIN_DATA1 PATH_TO_TRAIN_DATA2 -val_path PATH_TO_VAL_DATA1 PATH_TO_VAL_DATA2 -save_dir PATH_WHERE_TO_SAVE_TRAINED_MODEL -loss var -log_path logs/training_log 
python train_coverageloss.py -train_path PATH_TO_TRAIN_DATA1 PATH_TO_TRAIN_DATA2 -val_path PATH_TO_VAL_DATA1 PATH_TO_VAL_DATA2 -save_dir PATH_WHERE_TO_SAVE_TRAINED_MODEL -loss ent -log_path logs/training_log 

Other parameters can be changed/included, such as:

  • Batch size ( -bsz )
  • Gradient accumulation ( -grad_accum )
  • Epochs ( -epochs )
  • Stop counter ( -stop_counter ) (the number of epochs to continue training for while the validation loss is not improving)
  • Learning rate ( -lr )
  • Checkpoint ( -checkpoint ) (if you want to load in model and continue training from that checkpoint)

Generating Synthetic Data

Once a QAGen model is trained, a synthetic dataset can be generated for a dataset of dialogues or texts.

To generate a synthetic dataset, run:

python generate.py -test_path PATH_TO_TEXTS_OR_DIALOGUES -model_path PATH_TO_TRAINED_MODEL -generate PATH_TO_OUTPUT_FILE

Other parameters can be changed/included for the style of generated outputs, such as:

  • Max length ( -max_length )
  • Top k ( -top_k )
  • Top p ( -top_p )
  • Number of generated QA pairs per text/dialogue ( -num_qa ) More information on these parameters can be found here.

Evaluation

3 different types of automatic metrics can be run once a model is trained and a synthetic dataset is generated.

Macaw is a question answering system and can be used to indicate the answerability of generated QA pairs. From the synthetic dataset, generated questions are paired with the corresponding text/dialogue and given to Macaw. Macaw's predicted answer is then compared to the synthetically generated answer using EM, F1, and Bartscore.

To run:

python macaw_eval.py -data_path PATH_TO_SYNTHETIC_DATA -out_path PATH_FOR_OUTPUT_RESULTS 

2. Coverage

To measure the coverage of the synthetic dataset's generated QA pairs using variance coverage, run:

python coverage_eval.py -data_path PATH_TO_SYNTHETIC_DATA -ctype var -synthetic True

To measure the coverage of the synthetic dataset's generated QA pairs using entropy coverage, run:

python coverage_eval.py -data_path PATH_TO_SYNTHETIC_DATA -ctype ent -synthetic True

Human Annotation

The HTML file for the UI for human annotation for comparing two synthetic datasets can be found at human_eval/human_eval_UI.html A script for formatting a dataset from two synthetic datasets for human evaluation can be found at human_eval/human_eval_examples.ipynb