/unsupervised-metrics

Library for experimenting with state-of-the-art evaluation metrics like UScore

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

Unsupervised Metrics: UScore & Friends

Unsupervised-Metrics is a Python library which allows researchers and developers alike to experiment with state-of-the-art evaluation metrics for machine translation. The focus hereby lies on reference-free, unsupervised metrics, which do not make use of supervision (parallel data, references, human scores) in any way. However wrappers around some (weakly-)supervised metrics like XMoverScore and SentSim are provided for convenience.

Implemented Papers

Installation

If you want to use this project as a library you can install it as a regular package with pip:

pip install 'git+https://github.com/potamides/unsupervised-metrics.git#egg=metrics'

If your goal is to run the included experiments (e.g. to replicate the results of UScore) clone the repository and install it in editable mode:

git clone https://github.com/potamides/unsupervised-metrics
pip install -e unsupervised-metrics[experiments]

If you want to use fast-align follow its install instruction and make sure that the fast_align and atools programs are on your PATH. This requirement is optional.

Usage

Train an existing metric

One focus of this library is to make it easy to fine-tune existing state-of-the-art metrics for arbitrary language pairs and domains. A simple example is provided in the code block below. For more involved examples and means on how to instantiate a pre-trained metric take a look at the experiments.

from metrics.contrastscore import ContrastScore
from metrics.utils.dataset import DatasetLoader

src_lang, tgt_lang = "de", "en"

dataset = DatasetLoader(src_lang, tgt_lang)
# instantiate ContrastScore and enable parallel training on multiple GPUs
scorer = ContrastScore(source_language=src_lang, target_language=tgt_lang, parallelize=True)
# train the underlying language model on pseudo-parallel sentence pairs
scorer.train(*dataset.load("monolingual-train"))

# print correlations with human judgments
print("Pearson's r: {}, Spearman's ρ: {}".format(*scorer.correlation(*dataset.load("scored"))))

Create your own metric

This library can also be used as a framework to create new metrics, as demonstrated in the code block below. Existing metrics are defined in the metrics package, which could serve as a source of inspiration.

from metrics.common import CommonScore

class MyOwnMetric(CommonScore):
    def align():
        """
        This method receives a list of sentences in the source language and a
        list of sentences in the target language as parameters and returns
        a list of pseudo aligned sentence pairs.
        """

    def _embed():
        """
        This method receives a list of sentences in the source language and a
        list of sentences in the target language as parameters and returns
        their embeddings, inverse document frequences, tokens and padding
        masks.
        """

    def score():
        """
        This method receives a list of sentences in the source language and a
        list of sentences in the target language as parameters, which are
        assumed to be aligned according to their index. For each sentence pair
        a similarity score is computed and the list of scores is returned.
        """

Acknowledgments

This library is based on the following projects:

Citation

If you like/use our work, please cite as follows:

@inproceedings{belouadi-eger-2023-uscore,
    title = "{US}core: An Effective Approach to Fully Unsupervised Evaluation Metrics for Machine Translation",
    author = "Belouadi, Jonas  and
      Eger, Steffen",
    booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.eacl-main.27",
    pages = "358--374",
}