with a focus of summarization and factual consistency.
Organized by Robert Tang.
- Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text Sebastian Gehrmann, Elizabeth Clark, Thibault Sellam [pdf]
- Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors Liyan Tang, Tanya Goyal, Alexander R. Fabbri, Philippe Laban, Jiacheng Xu, Semih Yahvuz, Wojciech Kryściński, Justin F. Rousseau, Greg Durrett [pdf] [code]
- Faithfulness in Natural Language Generation: A Systematic Survey of Analysis, Evaluation and Optimization Methods Wei Li, Wenhao Wu, Moye Chen, Jiachen Liu, Xinyan Xiao and Hua Wu [pdf]
-
A Semantic QA-Based Approach for Text Summarization Evaluation Ping Chen, Fei Wu, Tong Wang, Wei Ding
AAAI18
[pdf] -
Answers Unite! Unsupervised Metrics for Reinforced Summarization Models Thomas Scialom, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano
EMNLP19
[pdf] -
Question answering as an automatic evaluation metric for news article summarization Matan Eyal, Tal Baumel, and Michael Elhadad
NAACL 2019
-
FEQA: A question answering evaluation framework for faithfulness assessment in abstractive summarization Esin Durmus, He He, and Mona Diab
ACL 2020
-
Asking and answering questions to evaluate the factual consistency of summaries Alex Wang, Kyunghyun Cho, and Mike Lewis
ACL 2020
-
Safeval: Summarization asks for fact-based evaluation Thomas Scialom, Paul-Alexis Dray, Gallinari Patrick, Lamprier Sylvain, Piwowarski Benjamin, Staiano Jacopo, and Wang Alex
EMNLP 2021
-
Q2: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering Or Honovich, Leshem Choshen, Roee Aharoni, Ella Neeman, Idan Szpektor, Omri Abend
EMNLP 2021
-
QuestEval: Summarization Asks for Fact-based Evaluation Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano, Alex Wang, Patrick Gallinari
EMNLP 2021
-
QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summarization Alexander R. Fabbri, Chien-Sheng Wu, Wenhao Liu, Caiming Xiong
NAACL 2022
[pdf] [code] -
Benchmarking Answer Verification Methods for Question Answering-Based Summarization Evaluation Metrics Daniel Deutsch, Dan Roth [pdf]
- DiscoScore: Evaluating Text Generation with BERT and Discourse Coherence Wei Zhao, Michael Strube, Steffen Eger [pdf] [code]
- WIDAR -- Weighted Input Document Augmented ROUGE Raghav Jain, Vaibhav Mavi, Anubhav Jangra, Sriparna Saha
ECIR 2022
[pdf] [code] - InfoLM: A New Metric to Evaluate Summarization & Data2Text Generation Pierre Colombo, Chloe Clave, Pablo Piantanida
AAAI 2022
[pdf]
- BLEURT: Learning Robust Metrics for Text Generation Thibault Sellam, Dipanjan Das, Ankur Parikh
ACL 2020
- BERTScore: Evaluating Text Generation with BERT Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, Yoav Artzi
ICLR 2020
- MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M. Meyer, Steffen
EMNLP 2019
- Fill in the BLANC: Human-free quality estimation of document summaries Oleg Vasilyev, Vedant Dharnidharka, John Bohannon
- BERTFaith: Focus Attention: Promoting Faithfulness and Diversity in Summarization Rahul Aralikatte, Shashi Narayan, Joshua Maynez, Sascha Rothe, Ryan McDonald
ACL 2021
[pdf]
- BARTScore: Evaluating Generated Text as Text Generation Weizhe Yuan, Graham Neubig, Pengfei Liu
NeurIPS 2021
- (SCConv) SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization Philippe Laban, Tobias Schnabel, Paul N. Bennett, Marti A. Hearst
TACL 2021
- Evaluating the Factual Consistency of Abstractive Text Summarization Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher
EMNLP20
[pdf] [code] - DAE: Annotating and Modeling Fine-grained Factuality in Summarization Tanya Goyal, Greg Durrett
NAACL 2021
-
Evaluating Factuality in Generation with Dependency-level Entailment Tanya Goyal, Greg Durrett
EMNLP 2020
-
FactGraph: Evaluating Factuality in Summarization with Semantic Graph Representations Leonardo F. R. Ribeiro, Mengwen Liu, Iryna Gurevych, Markus Dreyer, Mohit Bansal
NAACL 2022
[pdf] [code]
- Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts Elizabeth Clark, Asli Celikyilmaz, Noah A. Smith
ACL 2019
- Factual Consistency Evaluation for Text Summarization via Counterfactual Estimation Yuexiang Xie, Fei Sun, Yang Deng, Yaliang Li, Bolin Ding
EMNLP 2021 Findings
[pdf] [code]
Topic-Aware Evaluation and Transformer Methods for Topic-Controllable Summarization Tatiana Passali, Grigorios Tsoumakas `` [pdf]
-
Falsesum: Generating Document-level NLI Examples for Recognizing Factual Inconsistency in Summarization Prasetya Ajie Utama, Joshua Bambrick, Nafise Sadat Moosavi, Iryna Gurevych
NAACL 2022
[pdf] [code][Abs]
Neural abstractive summarization models are prone to generate summaries that are factually inconsistent with their source documents. Previous work has introduced the task of recognizing such factual inconsistency as a downstream application of natural language inference (NLI). However, state-of-the-art NLI models perform poorly in this context due to their inability to generalize to the target task. In this work, we show that NLI models can be effective for this task when the training data is augmented with high-quality task-oriented examples. We introduce Falsesum, a data generation pipeline leveraging a controllable text generation model to perturb human-annotated summaries, introducing varying types of factual inconsistencies. Unlike previously introduced document-level NLI datasets, our generated dataset contains examples that are diverse and inconsistent yet plausible. We show that models trained on a Falsesum-augmented NLI dataset improve the state-of-the-art performance across four benchmarks for detecting factual inconsistency in summarization. -
Gradient-Based Adversarial Factual Consistency Evaluation for Abstractive Summarization Zhiyuan Zeng, Jiaze Chen, Weiran Xu, Lei Li
EMNLP 2021
- Evaluating the Efficacy of Summarization Evaluation across Languages Fajri Koto, Jey Han Lau, Timothy Baldwin
Findings of ACL 2021
[pdf] - SummEval: Re-evaluating Summarization Evaluation Alexander R. Fabbri, Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher, Dragomir Radev [pdf] [code]
- FFCI: A Framework for Interpretable Automatic Evaluation of Summarization Fajri Koto, Jey Han Lau, Timothy Baldwin [pdf] [code]
- TRUE: Re-evaluating Factual Consistency Evaluation
NAACL 2022
[pdf] - DialSummEval: Revisiting Summarization Evaluation for Dialogues Mingqi Gao, Xiaojun Wan
NAACL 2022
- (CGS) Ranking Generated Summaries by Correctness: An Interesting but Challenging Application for Natural Language Inference Tobias Falke, Leonardo F. R. Ribeiro, Prasetya Ajie Utama, Ido Dagan, Iryna Gurevych
ACL19
[pdf] [data] - (XSF) On faithfulness and factuality in abstractive summarization Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald
- What Have We Achieved on Text Summarization? Dandan Huang, Leyang Cui, Sen Yang, Guangsheng Bao, Kun Wang, Jun Xie, Yue Zhang
EMNLP20
[pdf] - Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics Artidoro Pagnoni, Vidhisha Balachandran and Yulia Tsvetkov
NAACL21
[pdf] [code] - Asking and answering questions to evaluate the factual consistency of summaries Alex Wang, Kyunghyun Cho, and Mike Lewis
ACL 2020
- XSF: https://github.com/google-research-datasets/xsum_hallucination_annotations#license
- Unsupervised Reference-Free Summary Quality Evaluation via Contrastive Learning Hanlu Wu, Tengfei Ma, Lingfei Wu, Tariro Manyumwa, Shouling Ji
EMNLP20
[pdf] [code] - A Training-free and Reference-free Summarization Evaluation Metric via Centrality-weighted Relevance and Self-referenced Redundancy Wang Chen, Piji Li, Irwin King
ACL 2021
[pdf] [code] - Reference-free Summarization Evaluation via Semantic Correlation and Compression Ratio Yizhu Liu, Qi Jia, Kenny Zhu
NAACL 2022
[pdf] [code] - MaskEval: Weighted MLM-Based Evaluation for Text Summarization and Simplification Yu Lu Liu, Rachel Bawden, Thomas Scaliom, Benoît Sagot, Jackie Chi Kit Cheung
- HIGHRES: Highlight-based Reference-less Evaluation of Summarization Hardy, Shashi Narayan, Andreas Vlachos
ACL19
[pdf] [code] - Is human scoring the best criteria for summary evaluation?
Findings of ACL 2021
Oleg Vasilyev, John Bohannon [pdf] - How to Evaluate a Summarizer: Study Design and Statistical Analysis for Manual Linguistic Quality Evaluation Julius Steen, Katja Markert
EACL21
[pdf] [code] - Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics Artidoro Pagnoni, Vidhisha Balachandran and Yulia Tsvetkov
NAACL21
[pdf] [code]