This repository contains the dataset for the chat shared task organized with WMT 2024.
The dataset is provided in a csv format, with each row specifying the source language, target language, source, reference, document id, the sender information and the client id.
Table 1: Number of source segments in the released dataset.
language pair | train | valid |
---|---|---|
EN <-> DE | 17805 | 2569 |
EN <-> FR | 15027 | 3007 |
EN <->PT-BR | 15092 | 2550 |
EN <-> KO | 16122 | 1935 |
EN <-> NL | 15463 | 2549 |
Our baseline system uses NLLB-3.3B model to generate translations. The quality of translations using automatic metrics are provided below:
From Engish -> XX
language pair | chrF | COMET |
---|---|---|
EN -> DE | 66.24 | 86.76 |
EN -> FR | 74.31 | 88.85 |
EN ->PT-BR | 60.68 | 87.38 |
EN -> KO | 30.47 | 84.59 |
EN -> NL | 60.37 | 87.29 |
From XX -> English
language pair | chrF | COMET |
---|---|---|
DE -> EN | 65.92 | 85.88 |
FR -> EN | 72.53 | 85.44 |
PT-BR -> EN | 67.13 | 86.83 |
KO -> EN | 54.47 | 82.81 |
NL -> EN | 66.90 | 86.58 |
We additionally release a scoring script to compute the automatic metrics as descibed in the shared task page.
To use the scoring script you need to install the following libraries in the following order:
- Install MuDA (Fernandes et al., 2021) and package requirements by:
git clone https://github.com/CoderPat/MuDA.git
pip install allennlp==2.10.0 sacremoses==0.0.53 spacy==3.3.0 spacy_stanza==1.0.2
- Set MuDA path to
export MUDA_HOME=<path_to_muda>
- Install SacreBLEU (Post 2018) using
pip install sacrebleu
- Install COMET (Rei et al., 2020) using:
pip install git+https://github.com/Unbabel/COMET.git
Usage:
for lp in en-de en-fr en-pt en-ko en-nl; do
python run_automatic_eval.py --input_csv valid/${lp}.csv --hypothesis_file valid/${lp}.baseline.txt --tgt-lang ${lp: -2}
done
If you are participating in the task, make sure to register your team along with the language pairs you intend to participate in using this registration form
Please note, that all the data released for the WMT24 Chat Translation task is under the license of CC-BY-NC-4.0 and can be freely used for research purposes only. Please note that, as the license states, no commercial uses are permitted for this corpus. We just ask that you cite the WMT24 Chat Translation Task overview paper. Any other use is not permitted unless previous written authorization is given by Unbabel.