/REALSumm

REALSumm: Re-evaluating Evaluation in Text Summarization

Primary LanguagePython

REALSumm: Re-evaluating EvALuation in Summarization

Outline

Leaderboard

ExplainaBoard

Motivation

Evaluating summarization is hard. Most papers still use ROUGE, but recently a host of metrics (eg BERTScore, MoverScore) report better correlation with human eval. However, these were tested on older systems (the classic TAC meta-evaluation datasets are now 6-12 years old), how do they fare with SOTA models? Will conclusions found there hold with modern systems and summarization tasks?

Released Data

Including all the system variants, there are total 25 system outputs - 11 extractive and 14 abstractive.

Please read our reproducibility instructions in addition to our paper in order to reproduce this work for another dataset.

Type Sys ID System Output Human Judgement Paper Variants Bib
Extractive 1 Download Download Heterogeneous Graph Neural Networks for Extractive Document Summarization Bib
2 Download Download Extractive Summarization as Text Matching Bib
3 Download Download Searching for Effective Neural Extractive Summarization: What Works and What’s Next LSTM+PN+RL Bib
4 Download Download BERT+TF+SL
5 Download Download BERT+TF+PN
6 Download Download BERT+LSTM+PN
7 Download Download BERT+LSTM+PN+RL
8 Download Download BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension Bib
9 Download Download Ranking Sentences for Extractive Summarization with Reinforcement Learning Bib
10 Download Download Neural Document Summarization by Jointly Learning to Score and Select Sentences Bib
11 Download Download BanditSum: Extractive Summarization as a Contextual Bandit Bib
Abstractive 12 Download Download Learning by Semantic Similarity Makes Abstractive Summarization Better Bib
13 Download Download BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension Bib
14 Download Download Text Summarization with Pretrained Encoders TransAbs Bib
15 Download Download Abs
16 Download Download ExtAbs
17 Download Download Pretraining-Based Natural Language Generation for Text Summarization Bib
18 Download Download Unified Language Model Pre-training for Natural Language Understanding and Generation v1 Bib
19 Download Download v2
20 Download Download Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer Base Bib
21 Download Download Large
22 Download Download 11B
23 Download Download Bottom-Up Abstractive Summarization Bib
24 Download Download Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting Bib
25 Download Download Get To The Point Summarization with Pointer-Generator Networks Bib

Meta-evaluation Tool

  1. Calculate the metric scores for each of the summary and create a scores dict in the below format. See the section below to calculate scores with a new metric. Make sure to include litepyramid_recall in the scores dict, which is the metric used by human evaluators.
  2. Run the analysis notebook on the scores dict to get all the graphs and tables used in the paper.

Calculating scores with a new metric

  1. Update scorer.py such that (1) if there is any setup required by your metric, it is done in the __init__ function of scorer as the scorer will be used to score all systems. And (2) add your metric in the score function as
elif self.metric == "name_of_my_new_metric":
  scores = call_to_my_function_which_gives_scores(passing_appropriate_arguments)

where scores is a list of scores corresponding to each summary in a file. It should be a list of dictionaries e.g. [{'precision': 0.0, 'recall': 1.0} ...]

  1. Calculate the scores and the scores dict using python get_scores.py --data_path ../selected_docs_for_human_eval/<abs or ext> --output_path ../score_dicts/abs_new_metric.pkl --log_path ../logs/scores.log -n_jobs 1 --metric <name of metric>
  2. Your scores dict is generated at the output path.
  3. Merge it with the scores dict with human scores provided in scores_dicts/ using python score_dict_update.py --in_path <score dicts folder with the dicts to merge> --out_path <output path to place the merged dict pickle> -action merge
  4. Your dict will be merged with the one with human scores and the output will be placed in out_path. You can now run the analysis notebook on the scores dict to get all the graphs and tables used in the paper.

Scores dict format used

{
    doc_id: {
            'doc_id': value of doc id,
            'ref_summ': reference summary of this doc,
            'system_summaries': {
                system_name: {
                        'system_summary': the generated summary,
                        'scores': {
                            'js-2': the actual score,
                            'rouge_l_f_score': the actual score,
                            'rouge_1_f_score': the actual score,
                            'rouge_2_f_score': the actual score,
                            'bert_f_score': the actual score
                        }
                }
            }
        }
}

Bib

@inproceedings{Bhandari-2020-reevaluating,
title = "Re-evaluating Evaluation in Text Summarization",
author = "Bhandari, Manik  and Narayan Gour, Pranav  and Ashfaq, Atabak  and  Liu, Pengfei and Neubig, Graham ",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
year = "2020"
}