/CDEvalSumm

Primary LanguagePython

CDEvalSumm: Cross-Dataset Evaluation for Summarization

Descriptions and metrics code for EMNLP2020 findings paper:

(Yiran Chen*, Pengfei Liu*, Ming Zhong, Zi-Yi Dou, Danqing Wang, Xipeng Qiu, Xuanjing Huang)

Motivation

Many work evaluate summarization systems on in-domain setting (the model is trained and tested on the same dataset). In this work we try to understand model performance on different perspectives on a cross-dataset setting. The picture blow represents the main motivation (summarization systems get different rankings when evaluated under different measures where abstractive models are red while extractive ones are blue):

Two Research Questions

Q1: How do different neural architectures of summarizers influence the cross-dataset generalization performances?
Q2: Do different generation ways (extractive and abstractive) of summarizers influence the cross-dataset generalization ability?

Evaluation Systems

  • Extractive summarizers:
  • Abstractive summarizers:
SystemsPaperBib
Abs-SumLSTM_{non}Content Selection in Deep Learning Models of SummarizationBib
Trans_nonText Summarization with Pretrained EncodersBib
Trans_{auto}Searching for Effective Neural Extractive Summarization: What works and What’s NextBib
BERT_{non}Text Summarization with Pretrained EncodersBib
BERT_{match}Extractive Summarization as Text MatchingBib
Ext-SumL2L^{cov}_{ptr}Get to the point: Summarization with Pointer-Generator NetworksBib
L2L_{ptr}Get to the point: Summarization withpointer-generator networksBib
L2LCDEvalSumm: An Empirical Study of Cross-Dataset Evaluationfor Neural Summarization SystemsBib
T2TText Summarization with Pretrained EncodersBib
BE2TText Summarization with Pretrained EncodersBib
BARTBart: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and ComprehensionBib

Datasets

Evaluation Metrics

Cross-dataset Measures

  • Stiffness


    : the metric score when model is trained on dataset i and tested on dataset j.
  • Stableness


    : the metric score when model is trained on dataset i and tested on dataset j.

Experiment Results

The stiffness and stableness of various summarizers are displayed below. For fine-grained results and comprehensive analysis please refer to the paper.