/online_dialog_eval

Code for the paper "Learning an Unreferenced Metric for Online Dialogue Evaluation", ACL 2020

Primary LanguageJupyter NotebookOtherNOASSERTION

MaUde

Metric for automatic Unreferenced dialog evaluation.

Contains code of the paper titled "Learning an Unreferenced Metric for Online Dialogue Evaluation" to appear at ACL 2020, Arxiv

Installation

  • pip install -r requirements.txt
  • Install ParlAI

Getting the data

  • Get the convai2 train and test data and pre-trained Distilbert embeddings here. Download and unzip in the folder convai2_data.
  • Get the trained model checkpoints from here. Download and unzip into the folder full_acl_runs.
  • For individual licensing reasons we cannot release the train/test data of MultiWoz, Frames and DailyDialog. Please send me a mail if you need them!
  • Run inference using ./run_inference.sh

N.B. - For model names and checkpoints, please refer to run_inference.sh script.

Computing Backtranslation

We use FairSeq to compute back-translations. Our modified scripts are present in scripts folder, to run cd into that folder and run ./run_trans.sh.

Computing Corruption Files

In the data dump we already provide the corruption files used for training. To generate new corruption files on the dataset, use scripts/compute_corrupt.py.

Training Script

Uses Pytorch Lightning as the boilerplate for reproducibility.

python codes/trainer.py --mode train \
    --corrupt_type all \ 
    --batch_size 64 \
    --model_save_dir /checkpoint/koustuvs/dialog_metric/all_dailydialog_10_runs \
    --learn_down True --downsample True --down_dim 300 \
    --optim adam,lr=0.001 --dropout 0.2 --decoder_hidden 200 \ 
    --data_name convai2 \ 
    --data_loc /checkpoint/koustuvs/dialog_metric/convai2_data/ \
    --use_cluster

For baselines, add the appropriate flag:

--train_baseline [infersent/ruber/bertnli]

An example training script is provided at run_training.sh

Inference Script

# CUDA_VISIBLE_DEVICES=0 python codes/inference.py \ 
    --id $MODEL_ID --model_save_dir $MODEL_SAVE_DIR \
    --model_version $VERSION --train_mode nce \ 
    --corrupt_pre $DATA_LOC --test_suffix true_response \ 
    --test_column true_response --results_file "results.jsonl"
  • Outputs the results in a jsonl file. To measure human correaltion with See et al 2019, specify --human_eval flag and --human_eval_file location.
  • We have also added the script to run inference on our trained checkpoints - run_inference.sh.

Acknowledgements

Questions

Citation

If our work is useful for your research, consider citing it using the following bibtex:

@article{sinha2020maude,
  Author = {Koustuv Sinha and Prasanna Parthasarathi and Jasmine Wang and Ryan Lowe and William L. Hamilton and Joelle Pineau},
  Title = {Learning an Unreferenced Metric for Online Dialogue Evaluation},
  Year = {2020},
  journal = {ACL},
  arxiv = {2005.00583},
  url = {https://arxiv.org/abs/2005.00583}
}

License

This work is CC-BY-NC 4.0 (Attr Non-Commercial Inter.) licensed, as found in the LICENSE file.