/Natalia-Khaidanova-machine-translation-evaluation

Evaluation of MT metrics on the WMT-21 Metrics data for English→Russian.

Primary LanguageRubyMIT LicenseMIT

Machine-Translation Evaluation: Comparing Traditional and Neural Machine-Translation Evaluation Metrics for English→Russian

Machine translation (MT) has become increasingly popular in recent years due to advances in technology and growing globalization. As the quality of MT continues to improve, more and more companies are turning to this method over human translation to save time and money. However, the increasing reliance on MT has also highlighted the need for automatic evaluation algorithms that can accurately measure its quality. Developing such algorithms is essential in ensuring that MT can effectively meet the needs of businesses and individuals in the global marketplace, as well as in comparing different MT systems against each other and tracking their improvements over time. MT evalaution metrics are an indispensable component of these automatic evaluation algorithms.

This repository is part of the thesis project for the Master's Degree in "Linguistics: Text Mining" at the Vrije Universiteit Amsterdam (2022-2023). The project focuses on replicating selected research conducted at the WMT21 Metrics Shared Task. The replication involves evaluating the traditional (SacreBLEU, TER, CHRF2) and best-performing reference-based (BLEURT-20, COMET-MQM_2021) and reference-free (COMET-QE-MQM_2021) neural metrics. The evaluation is conducted across two domains: news articles and TED talks translated from English into Russian. By examining the performance of these metrics, we aim to understand their effectiveness and suitability in different translation contexts. Furthermore, the thesis project goes beyond the initial evaluation and explores the applicability of reference-free neural metrics, with a particular focus on COMET-QE-MQM_2021, for professional human translators. This extended evaluation is performed on a distinct domain, namely scientific articles. The articles are translated in the same direction as the primary data.

Creator: Natalia Khaidanova

Supervisor: Sophie Arnoult

Content

\Data

The Data folder contains:

Files:

  • all_TED_data.tsv stores all source sentences, reference translations, and MTs presented at the WMT21 Metrics Task for the TED talks domain.

  • all_news_data.tsv stores all source sentences, reference translations, and MTs presented at the WMT21 Metrics Task for the news domain.

  • create_data_files.py creates all_TED_data.tsv and all_news_data.tsv files, converts the WMT21 Metrics Task human judgments per type (MQM, raw DA, and z-normalized DA) and domain (news and TED talks) into .tsv files. The files are stored in human_judgments_seg (segment-level human judgments) and human_judgments_sys (system-level human judgments).

Subfolders:

\eval

The eval folder contains:

Files:

  • get_nr_annotations.py checks the number of annotated segments in the WMT21 Metrics Task data per type of human judgment (MQM, raw DA, or z-normalized DA).

  • seg_eval.py runs a segment-level evaluation of the implemented neural (BLEURT-20, COMET-MQM_2021, and COMET-QE-MQM_2021) and traditional (SacreBLEU, TER, and CHRF2) metrics.

  • sys_eval.py runs a system-level evaluation of the implemented neural (BLEURT-20, COMET-MQM_2021, and COMET-QE-MQM_2021) and traditional (SacreBLEU, TER, and CHRF2) metrics.

Subfolders:

  • human_judgments_seg stores segment-level human judgment scores of each type (MQM, raw DA, or z-normalized DA) in separate .tsv files. The scores are presented for both news and TED talks.

  • human_judgments_sys stores system-level human judgment scores of each type (MQM, raw DA, or z-normalized DA) in separate .tsv files. The scores are presented for both news and TED talks.

\metrics

The metrics folder contains:

Files:

\reference-free_eval

The reference-free_eval folder contains:

Files:

  • COMET-QE-MQM_2021.py computes segment- and system-level scores of the reference-free neural metric COMET-QE-MQM_2021 on the additional data comprising two scientific articles (Baby K and A Beautiful Mind). The metric evaluates both human and machine translations. Note that the source sentences and their human translations were added to the files manually.

  • add_opus_mt_translations.py adds MTs produced by the opus-mt-en-ru MT system to the data comprising two scientific articles (Baby K and A Beautiful Mind).

  • get_mean_length.py counts the mean character length of the source sentences and their human translations in the Baby K and A Beautiful Mind articles.

Subfolders:

  • Data contains two scientific articles (Baby K and A Beautiful Mind), each comprising English source sentences, their corresponding Russian human translations and MTs produced by the opus-mt-en-ru MT system. The files were created with the aim of evaluating the applicability of reference-free neural metrics, specifically COMET-QE-MQM_2021, for professional human translators. The subfolder also stores the segment- and system-level scores produced by COMET-QE-MQM_2021 for both human and machine translations.

requirements.txt

The requirements.txt file contains information about the packages and models required to run and evaluate the implemented traditional (SacreBLEU, TER, and CHRF2) and neural (BLEURT-20, COMET-MQM_2021, and COMET-QE-MQM_2021) metrics. It also lists additional packages needed to run all the .py files in the repository.

Natalia_Khaidanova_Thesis.pdf

The Natalia_Khaidanova_Thesis.pdf file contains the thesis report outlining the results of the research.

References