This is a simple toolkit to evaluate the performance of Machine Translation metrics on standard test sets from the WMT Metrics Shared Tasks. It bundles all data relevant to metric development and evaluation for a given test set and language pair, and lets you do the following:
- List the names of documents and MT systems in the set.
- Get source and reference texts, and MT system outputs.
- Get stored human and metric scores for MT system outputs.
- Compute correlations between human and metric scores, at system, document, and segment granularities.
- Perform significance tests on correlations.
These can be done on the command line using a python script, or from an API.
You need python3. To install:
git clone https://github.com/google-research/mt-metrics-eval.git
cd mt-metrics-eval
pip install .
Start by downloading the database (this is also required before using the API):
alias mtme='python3 -m mt_metrics_eval.mtme'
mtme --download # Puts ~1G of data into $HOME/.mt-metrics-eval.
mtme --help # List options.
Optionally test the install:
python3 -m unittest mt_metrics_eval.data_test # Takes about 30 seconds.
python3 -m unittest mt_metrics_eval.stats_test
Get information about test sets:
mtme --list # List available test sets.
mtme --list -t wmt20 # List language pairs for wmt20.
mtme --list -t wmt20 -l en-de # List details for wmt20 en-de.
Get contents of test sets. Paste doc-id, source, standard reference, alternative reference to stdout:
mtme -t wmt20 -l en-de --echo doc,src,ref,refb
Outputs from all systems, sequentially, pasted with doc-ids, source, and reference:
mtme -t wmt20 -l en-de --echosys doc,src,ref
Evaluate metric score files containing tab-separated "system-name score"
entries. For system-level correlations, supply one score per system. For
document-level or segment-level correlations, supply one score per document or
segment, grouped by system, in the same order as text generated using --echo
(the same order as the WMT test-set file). Granularity is determined
automatically:
examples=$HOME/.mt-metrics-eval/mt-metrics-eval/wmt20/metric-scores/en-de
mtme -t wmt20 -l en-de < $examples/sentBLEU.sys.score
mtme -t wmt20 -l en-de < $examples/sentBLEU.doc.score
mtme -t wmt20 -l en-de < $examples/sentBLEU.seg.score
Compare to MQM gold scores instead of WMT gold scores:
mtme -t wmt20 -l en-de -g mqm < $examples/sentBLEU.sys.score
Generate correlations for two metrics files, and perform tests to determine whether they are significantly different:
mtme -t wmt20 -l en-de -i $examples/sentBLEU.sys.score -c $examples/COMET.sys.score
The scripts score_mqm
and score_sqm
can be used to convert MQM and SQM
annotations from Google's human annotation data for WMT20 into score files in
mt-metrics-eval format. For example:
wget https://github.com/google/wmt-mqm-human-evaluation/raw/main/ende/mqm_newstest2020_ende.tsv
python3 -m mt_metrics_eval.score_mqm < mqm_newstest2020_ende.tsv > mqm.seg.score
diff -s mqm.seg.score .mt-metrics-eval/mt-metrics-eval/wmt20/human-scores/en-de.mqm.seg.score
# Files should be identical.
Other options let you explore different error weightings or extract scores from individual annotators.
WMT correlations for a new metric, all granularities:
from mt_metrics_eval import data
def MyMetric(out, ref):
"""Return a scalar score for given output/reference texts."""
return -sum(abs(len(a) - len(b)) for a, b in zip(out, ref))
evs = data.EvalSet('wmt20', 'en-de')
sys_scores, doc_scores, seg_scores = {}, {}, {}
ref = evs.ref
for s, out in evs.sys_outputs.items():
sys_scores[s] = [MyMetric(out, ref)]
doc_scores[s] = [MyMetric(out[b:e], ref[b:e]) for b, e in evs.docs.values()]
seg_scores[s] = [MyMetric([o], [r]) for o, r in zip(out, ref)]
# Official WMT correlations.
print('sys Pearson:', evs.Correlation('sys', sys_scores).Pearson()[0])
print('doc Kendall:', evs.Correlation('doc', doc_scores).KendallLike()[0])
print('seg Kendall:', evs.Correlation('seg', seg_scores).KendallLike()[0])
Correlations using alternative gold scores:
from mt_metrics_eval import data
def MyMetric(out, ref):
"""Return a scalar score for given output/reference texts."""
return -sum(abs(len(a) - len(b)) for a, b in zip(out, ref))
evs = data.EvalSet('wmt20', 'en-de')
scores = {s: [MyMetric(out, evs.ref)] for s, out in evs.sys_outputs.items()}
print('MQM Pearson:', evs.Correlation('sys', scores, 'mqm').Pearson()[0])
print('pSQM Pearson:', evs.Correlation('sys', scores, 'psqm').Pearson()[0])
print('cSQM Pearson:', evs.Correlation('sys', scores, 'csqm').Pearson()[0])
# The correlations above aren't comparable to WMT, since the *qm gold scores are
# available only for a subset of system outputs. To get comparable WMT scores,
# we need to explicitly specify these systems.
qm_sys = set(evs.Scores('sys', 'mqm')) - evs.human_sys_names
print('WMT Pearson:', evs.Correlation('sys', scores, sys_names=qm_sys).Pearson()[0])
New correlations for a stored metric:
from mt_metrics_eval import data
# Eval set will load more slowly, since we're reading in all stored metrics.
evs = data.EvalSet('wmt20', 'en-de', read_stored_metric_scores=True)
scores = evs.Scores('sys', scorer='BLEURT-extended')
qm_sys = set(evs.Scores('sys', 'mqm')) - evs.human_sys_names
qm_sys_bp = qm_sys | {'Human-B.0', 'Human-P.0'}
wmt = evs.Correlation('sys', scores, sys_names=qm_sys)
mqm = evs.Correlation('sys', scores, gold_scorer='mqm', sys_names=qm_sys)
wmt_bp = evs.Correlation('sys', scores, sys_names=qm_sys_bp)
mqm_bp = evs.Correlation('sys', scores, gold_scorer='mqm', sys_names=qm_sys_bp)
print('BLEURT-ext WMT Pearson for qm systems:', wmt.Pearson()[0])
print('BLEURT-ext MQM Pearson for qm systems:', mqm.Pearson()[0])
print('BLEURT-ext WMT Pearson for qm systems plus human:', wmt_bp.Pearson()[0])
print('BLEURT-ext MQM Pearson for qm systems plus human:', mqm_bp.Pearson()[0])
Inspired by and loosely modeled on SacreBLEU.
This is not an official Google product.