/FaithEval-FFLM

A zero-shot faithfulness evaluation metric for text summarization

Primary LanguagePython

Zero-shot Faithfulness Evaluation for Text Summarization with Foundation Language Model

This paper has been accepted by EMNLP2023.

Requirements

  • python==3.7
  • pytorch==1.11.0
  • transformers==4.28.1
  • scipy==1.7.3
  • scikit-learn==1.0.2
  • numpy==1.21.5

Prepare datasets

Download the benchmark datasets and put them under the directory ./data. Modify corresponding paths in load_dataset.py if necessary.

Setting Dataset Val Test Source Link
Inconsistency Detection
(SUMMAC Benchmark)
CoGenSum 1281 400 C https://github.com/tingofurro/summac
SummEval 850 850 C
FRANK 671 1575 C+X
Polytope 634 634 C
FactCC 931 503 C
XSumFaith 1250 1250 C
Faithfulness Rating FRANKCNN - 1250 C https://github.com/NJUNLP/CoP
QAGSCNN - 235 C
SummEval - 1600 C https://github.com/Yale-LILY/SummEval
FRANKXSUM - 996 X https://github.com/NJUNLP/CoP
QAGSXSUM - 239 X

Probability Caculation

Calculate the probabilities based on a foundation language model by:

CUDA_VISIBLE_DEVICES=0 python3 main.py

The results will be saved under the directory ./output, or can be downloaded with this link.

FFLM

Then, the summary-level and system-level performances of FFLM can be calculated as follows:

python3 summary-level-evaluation.py --file_path xxx
python3 system-level-evaluation.py --file_path xxx

Citation

@article{jia2023fflm,
  title={Zero-shot Faithfulness Evaluation for Text Summarization with Foundation Language Model},
  author={Qi Jia, Siyu Ren, Yizhu Liu, Kenny Q. Zhu},
  jbooktitle={Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing},
  year={2023}
}