Code, data, and additional analysis for the paper Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics
https://www.aclweb.org/anthology/2020.acl-main.448/
Data: contains files with WMT19 system-level metric and human scores
top-n: contains a pdf for the figures of top-n vs rolling window method of subsampling for all language pairs, as described in section 4.1 of the paper.
Outliers: Code to compute correlations with and without outliers, as described in section 4.2 of the paper