Repository to replicate the ROUGE scores from See et al. (2017).
We find that the reported scores correspond to those produced by the python re-implementation py-rouge, instead of those by produced by the official Rouge 155 Perl wrapper pyrouge.
The evaluate.py script accepts a 'hypothesis' folder and a 'reference' folder. The ROUGE scores computed with py-rouge and pyrouge respectively are then computed and printed to standard out.
The test_output folder, contains the test outputs from See et al. (2017), that can be downloaded from the README.md of the official repository.
pip install py-rouge pyrouge
Ensure Perl XML library is installed:
On Arch Linux: sudo pacman -S perl-xml-xpath
On Ubuntu: sudo apt-get install libxml-parser-perl
ROUGE 155 install tips/debugging:
https://stackoverflow.com/questions/47045436/how-to-install-the-python-package-pyrouge-on-microsoft-windows
Note that pyrouge evaluates ~4x as slow as py-rouge, so some patience is required.
# Evaluate Pointer Generator
$> python evaluate.py test_output/pointer-gen test_output/reference
Python (py-rouge) scores:
ROUGE-1 (F1): 36.43
ROUGE-2 (F1): 15.66
ROUGE-L (F1): 33.42
Perl (pyrouge) scores:
ROUGE-1 (F1): 36.16
ROUGE-2 (F1): 15.61
ROUGE-L (F1): 33.21
# Evaluate Pointer Generator + Coverage
$> python evaluate.py test_output/pointer-gen-cov test_output/reference
Python (py-rouge) scores:
ROUGE-1 (F1): 39.53
ROUGE-2 (F1): 17.28
ROUGE-L (F1): 36.38
Perl (pyrouge) scores:
ROUGE-1 (F1): 39.24
ROUGE-2 (F1): 17.22
ROUGE-L (F1): 36.15