This repository contains the code and data for the following paper:
Finding a Balanced Degree of Automation for Summary Evaluation
@inproceedings{zhang2021finding,
title={Finding a Balanced Degree of Automation for Summary Evaluation},
author={Zhang, Shiyue and Bansal, Mohit},
booktitle={The 2021 Conference on Empirical Methods in Natural Language Processing},
year={2021}
}
Note that this repo is still work-in-progress.
- Python 3
- requirements.txt
Run the following command to get the Lite2Pyramid score for abstractive BART summaries for 100 CNN/DM examples from REALSumm.
python Lite2-3Pyramid.py --unit data/REALSumm/SCUs.txt --summary data/REALSumm/abs_bart_out.summary --label data/REALSumm/abs_bart_out.label
expected output: {'p2c': 0.5014365067350771, 'l2c': 0.5159722360972361, 'p3c': 0.43105412152833117, 'l3c': 0.4964144744144744, 'human': 0.48349483849483854, 'model_type': 'shiyue/roberta-large-tac08'}
To get its Lite3Pyramid score:
python Lite2-3Pyramid.py --unit data/REALSumm/STUs.txt --summary data/REALSumm/abs_bart_out.summary
expected output: {'p2c': 0.4535250155726269, 'l2c': 0.48114382512911924, 'p3c': 0.38368004398714206, 'l3c': 0.45765291326320745, 'human': None, 'model': 'shiyue/roberta-large-tac08'}
Usually, "p2c" should be token as the final score. When using the zero-shot NLI model (i.e., ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli), "l3c" should be used.
To extract STUs:
python Lite2-3Pyramid.py --extract_stus --reference data/REALSumm/references.txt --doc_id data/REALSumm/ids.txt --output_dir data/REALSumm --use_coref
Then, the extracted STUs for REALSumm references will be saved in "data/REALSumm/STUs.txt". Besides, two intermediate files (ref_srls.pkl and ref_corefs.pkl) will also be saved under "data/REALSumm".
We provide the zero-shot NLI model and 4 other NLI models finetuned on the 4 meta-evaluation sets respectively.
We suggest using X-finetuned NLI model on X dataset. When evaluating on a new dataset, we suggest using TAC08-finetuned model (i.e., shiyue/roberta-large-tac08) by default.
pretrained or finutuned on | Huggingface hub name |
---|---|
SNLI+MNLI+FEVER+ANLI | ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli |
TAC2008 | shiyue/roberta-large-tac08 |
TAC2009 | shiyue/roberta-large-tac09 |
REALSumm | shiyue/roberta-large-realsumm |
PyrXSum | shiyue/roberta-large-pyrxsum |
See reproduce
- reproduction code for cross-validation experiments on TAC08/09/PyrXSum
- reproduction code for out-of-the-box experiments
- provide version control via pypi package
- provide support through sacrerouge