MTME is a simple toolkit to evaluate the performance of Machine Translation metrics on standard test sets from the WMT Metrics Shared Tasks. It bundles data relevant to metric development and evaluation for a given test set and language pair, and lets you do the following:
- Access source, reference, and MT output text, along with associated meta-info, for the WMT metrics tasks from 2019-2023. This can be done via software, or by directly accessing the files in a linux directory structure, in a straightforward format.
- Access human and automatic metric scores for the above data.
- Reproduce the official results from the WMT metrics tasks. For WMT22, there is a colab to do this; other years require a bit more work.
- Compute various correlations and perform significance tests on correlation differences between two metrics.
These can be done on the command line using a python script, or from an API.
You need python 3.9 or later. To install:
git clone https://github.com/google-research/mt-metrics-eval.git
cd mt-metrics-eval
pip install .
This must be done before using the toolkit. You can either use the mtme script:
alias mtme='python3 -m mt_metrics_eval.mtme'
mtme --download # Puts ~2G of data into $HOME/.mt-metrics-eval.
Or download directly, if you're only interested in the data:
mkdir $HOME/.mt-metrics-eval
cd $HOME/.mt-metrics-eval
wget https://storage.googleapis.com/mt-metrics-eval/mt-metrics-eval-v2.tgz
tar xfz mt-metrics-eval-v2.tgz
Once data is downloaded, you can optionally test the install:
python3 -m unittest mt_metrics_eval.stats_test
python3 -m unittest mt_metrics_eval.data_test # Takes about 30 seconds.
python3 -m unittest mt_metrics_eval.tasks_test # Takes about 30 seconds.
Here are some examples of things you can do with the mtme script. They assume that the mtme alias above has been set up.
Get information about test sets:
mtme --list # List available test sets.
mtme --list -t wmt22 # List language pairs for wmt22.
mtme --list -t wmt22 -l en-de # List details for wmt22 en-de.
Get contents of test sets. Paste doc-id, source, standard reference, alternative reference to stdout:
mtme -t wmt22 -l en-de --echo doc,src,refA,refB
Outputs from all systems, sequentially, pasted with doc-ids, source, and reference:
mtme -t wmt22 -l en-de --echosys doc,src,refA
Human and metric scores for all systems, at all granularities:
mtme -t wmt22 -l en-de --scores > wmt22.en-de.tsv
Evaluate metric score files containing tab-separated system-name score
entries. For system-level correlations, supply one score per system. For
document-level or segment-level correlations, supply one score per document or
segment, grouped by system, in the same order as text generated using --echo
(the same order as the WMT test-set file). Granularity is determined
automatically. Domain-level scores are currently not supported by
this command.
examples=$HOME/.mt-metrics-eval/mt-metrics-eval-v2/wmt22/metric-scores/en-de
mtme -t wmt22 -l en-de < $examples/BLEU-refA.sys.score
mtme -t wmt22 -l en-de < $examples/BLEU-refA.seg.score
Compare to WMT appraise gold scores instead of MQM gold scores:
mtme -t wmt22 -l en-de -g wmt-appraise < $examples/BLEU-refA.sys.score
mtme -t wmt22 -l en-de -g wmt-appraise < $examples/BLEU-refA.seg.score
Compute correlations for two metrics files, and perform tests to determine whether they are significantly different:
mtme -t wmt22 -l en-de -i $examples/BLEU-refA.sys.score -c $examples/COMET-22-refA.sys.score
Compare all known metrics under specified conditions. This corresponds to one of the "tasks" in the WMT22 metrics evaluation. The first output line contains all relevant parameter settings, and subsequent lines show metrics in descending order of performance, followed by the rank of their significance cluster, the value of the selected correlation statistic, and a vector of flags to indicate significant differences with lower-ranked metrics. These examples use k_block=5 for demo purposes; using k_block=100 will approximately match official results but can take minutes to hours to complete, depending on the task.
# System-level Pearson
mtme -t wmt22 -l en-de --matrix --k_block 5
# System-level paired-rank accuracy, pooling results across all MQM languages
mtme -t wmt22 -l en-de,zh-en,en-ru --matrix \
--matrix_corr accuracy --k_block 5
# Segment-level item-wise averaged Kendall-Tau-Acc23 with optimal tie threshold
# using sampling rate of 1.0 (disabling significance testing for demo).
mtme -t wmt22 -l en-de --matrix --matrix_level seg --avg item \
--matrix_corr KendallWithTiesOpt --matrix_perm_test pairs \
--matrix_corr_args "{'variant':'acc23', 'sample_rate':1.0}" --k 0
The colab notebook mt_metrics_eval.ipynb
contains examples that show how to
use the API to load and summarize data, and compare stored metrics (ones that
participated in the metrics shared tasks) using different criteria. It also
demonstrates how you can incorporate new metrics into these comparisons.
The notebooks wmt22_metrics.ipynb
and wmt23_metrics.ipynb
document how the
official results for these tasks were generated.
We will try to provide similar notebooks for future evaluations.
The notebook ties_matter.ipynb
contains the code to reproduce the results
from Ties Matter: Meta-Evaluating Modern Metrics with Pairwise Accuracy and Tie Calibration.
It also contains examples for how to calculate the proposed pairwise accuracy
with tie calibration.
The scripts score_mqm
and score_sqm
can be used to convert MQM and SQM
annotations from Google's MQM annotation data into score files in
mt-metrics-eval format. For example:
git clone https://github.com/google/wmt-mqm-human-evaluation
python3 -m mt_metrics_eval.score_mqm \
--weights "major:5 minor:1 No-error:0 minor/Fluency/Punctuation:0.1" \
< wmt-mqm-human-evaluation/generalMT2022/ende/mqm_generalMT2022_ende.tsv \
> mqm.ende.seg.score
This produces an intermediate form with single scores per segment that match the scores in MTME; the file contains extra columns with rater id and other info.
Other options let you explore different error weightings or extract scores from individual annotators.
There is one top-level directory for each test set (e.g. wmt22
).
Each top-level directory contains the following sub-directories (whose contents
should be obvious from their names):
documents
, human-scores
, metric-scores
, references
, sources
, and
system-outputs
.
In general, a test-set contains data from many language pairs. Each combination of test-set and language pair (eg wmt22 + de-en) is called an EvalSet. This is the main unit of computation in the toolkit. Each EvalSet consists of a source text (divided into one or more documents, optionally with domain membership), reference translations, system outputs to be scored, human gold scores, and metric scores.
Meta information is encoded into directory and file names as specified below. The convention is intended to be straightforward, but there are a few subtleties:
- Reference translations can be scored as system outputs. When this is the case, the reference files should be copied into the system-outputs directory with matching names. For example:
references/de-en.refb.txt → system-outputs/de-en/refb.txt
- Metrics can come in different variants according to which reference(s) they used. This information is encoded into their filenames. To facilitate parsing, reference names can't contain dashes or dots, as outlined below.
- Metric files must contain scores for all files in the system output directory, except those that were used as references.
- Human score files don’t have to contain entries for all systems, or even for all segments for a given system. Missing entries are marked with ‘None’ strings.
The filename format and content specification for each kind of file are described below. Paths are relative to the top-level directory corresponding to a test set, e.g. wmt20. SRC and TGT designate abbreviations for the source and target language, e.g. ‘en’. Blanks designate any amount of whitespace.
- source text:
- filename:
sources/SRC-TGT.txt
- per-line contents: text segment
- filename:
- document meta-info:
- filename:
documents/SRC-TGT.docs
- per-line contents: DOMAIN DOCNAME
- lines match those in the source file
- documents are assumed to be contiguous blocks of segments
- DOMAIN tags can be repurposed for categories other than domain, but each document must belong to only one category
- filename:
- references:
- filename:
references/SRC-TGT.NAME.txt
- NAME is the name of this reference, e.g.
refb
. Names cannot be the reserved stringsall
orsrc
, or contain.
or-
characters.
- NAME is the name of this reference, e.g.
- per-line contents: text segment
- lines match those in the source file
- filename:
- system outputs:
- filename:
system-outputs/SRC-TGT/NAME.txt
- NAME is the name of an MT system or reference
- per-line contents: text segment
- lines match those in the source file
- filename:
- human scores:
- filename:
human-scores/SRC-TGT.NAME.LEVEL.score
- NAME describes the scoring method, e.g.
mqm
orwmt-z
. - LEVEL indicates the granularity of the scores, one of
sys
,domain
,doc
, orseg
.
- NAME describes the scoring method, e.g.
- per-line contents: [DOMAIN] SYSNAME SCORE
- DOMAIN is present only if granularity is
domain
- SYSNAME must match a NAME in system outputs
- SCORE may be
None
to indicate a missing score - System-level (
sys
) files contain exactly one score per system. - Domain-level (
domain
) files contain one score per domain and system. - Document-level (
doc
) files contain a block of scores for each system. Each block contains the scores for successive documents, in the same order they occur in the document info file. - Segment-level (
seg
) files contain a block of scores for each system. Each block contains the scores for all segments in the system output file, in order.
- DOMAIN is present only if granularity is
- filename:
- metric scores:
- filename
metric-scores/SRC-TGT/NAME-REF.LEVEL.score
- NAME is the metric’s base name.
- REF describes the reference(s) used for this version of the metric,
either:
- A list of one or more names separated by
.
, egrefa
orrefa.refb
. - The special string
src
to indicate that no reference was used. - The special string
all
to indicate that all references were used.
- A list of one or more names separated by
- LEVEL indicates the granularity of the scores, one of
sys
,domain
,doc
, orseg
.
- per-line contents: [DOMAIN] SYSNAME SCORE
- Format is identical to human scores, except that
None
entries aren't permitted.
- Format is identical to human scores, except that
- filename
Inspired by and loosely modeled on SacreBLEU.
This is not an official Google product.