Unsupervised-Metrics is a Python library which allows researchers and developers alike to experiment with state-of-the-art evaluation metrics for machine translation. The focus hereby lies on reference-free, unsupervised metrics, which do not make use of supervision (parallel data, references, human scores) in any way. However wrappers around some (weakly-)supervised metrics like XMoverScore and SentSim are provided for convenience.
Implemented Papers
If you want to use this project as a library you can install it as a regular package with pip:
pip install 'git+https://github.com/potamides/unsupervised-metrics.git#egg=metrics'
If your goal is to run the included experiments (e.g. to replicate the results of UScore) clone the repository and install it in editable mode:
git clone https://github.com/potamides/unsupervised-metrics
pip install -e unsupervised-metrics[experiments]
If you want to use fast-align follow its
install instruction and make sure that the fast_align
and atools
programs
are on your PATH
. This requirement is optional.
One focus of this library is to make it easy to fine-tune existing state-of-the-art metrics for arbitrary language pairs and domains. A simple example is provided in the code block below. For more involved examples and means on how to instantiate a pre-trained metric take a look at the experiments.
from metrics.contrastscore import ContrastScore
from metrics.utils.dataset import DatasetLoader
src_lang, tgt_lang = "de", "en"
dataset = DatasetLoader(src_lang, tgt_lang)
# instantiate ContrastScore and enable parallel training on multiple GPUs
scorer = ContrastScore(source_language=src_lang, target_language=tgt_lang, parallelize=True)
# train the underlying language model on pseudo-parallel sentence pairs
scorer.train(*dataset.load("monolingual-train"))
# print correlations with human judgments
print("Pearson's r: {}, Spearman's ρ: {}".format(*scorer.correlation(*dataset.load("scored"))))
This library can also be used as a framework to create new metrics, as demonstrated in the code block below. Existing metrics are defined in the metrics package, which could serve as a source of inspiration.
from metrics.common import CommonScore
class MyOwnMetric(CommonScore):
def align():
"""
This method receives a list of sentences in the source language and a
list of sentences in the target language as parameters and returns
a list of pseudo aligned sentence pairs.
"""
def _embed():
"""
This method receives a list of sentences in the source language and a
list of sentences in the target language as parameters and returns
their embeddings, inverse document frequences, tokens and padding
masks.
"""
def score():
"""
This method receives a list of sentences in the source language and a
list of sentences in the target language as parameters, which are
assumed to be aligned according to their index. For each sentence pair
a similarity score is computed and the list of scores is returned.
"""
This library is based on the following projects:
- ACL20-Reference-Free-MT-Evaluation
- Unsupervised-crosslingual-Compound-Method-For-MT
- Seq2Seq examples of transformers
- VecMap
- CRISS
If you like/use our work, please cite as follows:
@inproceedings{belouadi-eger-2023-uscore,
title = "{US}core: An Effective Approach to Fully Unsupervised Evaluation Metrics for Machine Translation",
author = "Belouadi, Jonas and
Eger, Steffen",
booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics",
month = may,
year = "2023",
address = "Dubrovnik, Croatia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.eacl-main.27",
pages = "358--374",
}