/flores

Facebook Low Resource (FLoRes) MT Benchmark

Primary LanguagePythonCreative Commons Attribution Share Alike 4.0 InternationalCC-BY-SA-4.0


The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation

FLORES-101 is a Many-to-Many multilingual translation benchmark dataset for 101 languages.

Looking for FLORESv1, which included Nepali, Sinhala, Pashto, and Khmer? Click here

Abstract

One of the biggest challenges hindering progress in low-resource and multilingual machine translation is the lack of good evaluation benchmarks. Current evaluation benchmarks either lack good coverage of low-resource languages, consider only restricted domains, or are low quality because they are constructed using semi-automatic procedures. In this work, we introduce the FLORES evaluation benchmark, consisting of 3001 sentences extracted from English Wikipedia and covering a variety of different topics and domains. These sentences have been translated in 101 languages by professional translators through a carefully controlled process. The resulting dataset enables better assessment of model quality on the long tail of low-resource languages, including the evaluation of many-to-many multilingual translation systems, as all translations are multilingually aligned. By publicly releasing such a high-quality and high-coverage dataset, we hope to foster progress in the machine translation community and beyond.

Download FLORES-101 Dataset

The data can be downloaded from: Here.

Evaluation

SPM-BLEU

For evaluation, we use SentencePiece BLEU (spBLEU) which uses a SentencePiece (SPM) tokenizer with 256K tokens and then BLEU score is computed on the sentence-piece tokenized text. This requires installing sacrebleu using a specific branch:

git clone --single-branch --branch adding_spm_tokenized_bleu https://github.com/ngoyal2707/sacrebleu.git
cd sacrebleu
python setup.py install

Offline Evaluation

Download FLORES-101 dev and devtest dataset

cd ~/
wget https://dl.fbaipublicfiles.com/flores101/dataset/flores101_dataset.tar.gz
tar -xvzf flores101_dataset.tar.gz

Compute spBLEU

Instructions for computing spBLEU for detokenized translations generated by a model

flores101_devtest=flores101_dataset/devtest

# Path to generated detokenized translations file
translation_file=/path/to/detok_trans.txt

# Set the target language (for this example, English)
trg_lang=eng

cat $translation_file | sacrebleu -tok spm $flores101_devtest/${trg_lang}.devtest

Example walkthrough of Generation and Evaluation using a pre-trained model in fairseq

Following example walks shows evaluating released M2M-124 615M model on an example language pair of Nyanja -> Swahili on FLORES-101 devtest which achieves 12.4 spBLEU.

Download model, sentencepiece vocab

fairseq=/path/to/fairseq
cd $fairseq

# Download 615M param model.
wget https://dl.fbaipublicfiles.com/flores101/pretrained_models/flores101_mm100_615M.tar.gz

# Extract 
tar -xvzf flores101_mm100_615M.tar.gz

Encode using our SentencePiece Model

Note: Install SentencePiece from here

flores101_dataset=/path/to/flores_dataset
fairseq=/path/to/fairseq
cd $fairseq

# Example lang pair translation: Nyanja -> Swahili
# MM100 code for Nyanja and Swahili: ny, sw

SRC_LANG_CODE=nya
TRG_LANG_CODE=swh

SRC_MM100_LANG_CODE=ny
TRG_MM100_LANG_CODE=sw

python scripts/spm_encode.py \
    --model flores101_mm100_615M/sentencepiece.bpe.model \
    --output_format=piece \
    --inputs=$flores101_dataset/devtest/${SRC_LANG_CODE}.devtest \
    --outputs=spm.${SRC_MM100_LANG_CODE}-${TRG_MM100_LANG_CODE}.${SRC_MM100_LANG_CODE}

python scripts/spm_encode.py \
    --model flores101_mm100_615M/sentencepiece.bpe.model \
    --output_format=piece \
    --inputs=$flores101_dataset/devtest/${TRG_LANG_CODE}.devtest \
    --outputs=spm.${SRC_MM100_LANG_CODE}-${TRG_MM100_LANG_CODE}.${TRG_MM100_LANG_CODE}

Binarization

fairseq-preprocess \
    --source-lang ${SRC_MM100_LANG_CODE} --target-lang ${TRG_MM100_LANG_CODE} \
    --testpref spm.${SRC_MM100_LANG_CODE}-${TRG_MM100_LANG_CODE} \
    --thresholdsrc 0 --thresholdtgt 0 \
    --destdir data_bin_${SRC_MM100_LANG_CODE}_${TRG_MM100_LANG_CODE} \
    --srcdict flores101_mm100_615M/dict.txt --tgtdict flores101_mm100_615M/dict.txt

Generation

fairseq-generate \
    data_bin_${SRC_MM100_LANG_CODE}_${TRG_MM100_LANG_CODE} \
    --batch-size 1 \
    --path flores101_mm100_615M/model.pt \
    --fixed-dictionary flores101_mm100_615M/dict.txt \
    -s ${SRC_MM100_LANG_CODE} -t ${TRG_MM100_LANG_CODE} \
    --remove-bpe 'sentencepiece' \
    --beam 5 \
    --task translation_multi_simple_epoch \
    --lang-pairs flores101_mm100_615M/language_pairs.txt \
    --decoder-langtok --encoder-langtok src \
    --gen-subset test \
    --fp16 \
    --dataset-impl mmap \
    --distributed-world-size 1 --distributed-no-spawn \
    --results-path generation_${SRC_MM100_LANG_CODE}_${TRG_MM100_LANG_CODE}

# clean fairseq generated file to only create hypotheses file.
cat generation_${SRC_MM100_LANG_CODE}_${TRG_MM100_LANG_CODE}/generate-test.txt  | grep -P '^H-'  | cut -c 3- | sort -n -k 1 | awk -F "\t" '{print $NF}' > generation_${SRC_MM100_LANG_CODE}_${TRG_MM100_LANG_CODE}/sys.txt

spBLEU Evaluation

# Get score
sacrebleu flores101_dataset/devtest/${TRG_LANG_CODE}.devtest < generation_${SRC_MM100_LANG_CODE}_${TRG_MM100_LANG_CODE}/sys.txt --tokenize spm
# Expected Outcome:
# BLEU+case.mixed+numrefs.1+smooth.exp+tok.spm+version.1.5.0 = 12.4 34.9/15.8/8.7/4.9 (BP = 1.000 ratio = 1.007 hyp_len = 37247 ref_len = 36999)

List of Languages

Language FLORES-101 code MM100 lang code
Akrikaans afr af
Amharic amh am
Arabic ara ar
Armenian hye hy
Assamese asm as
Asturian ast ast
Azerbaijani azj az
Belarusian bel be
Bengali ben bn
Bosnian bos bs
Bulgarian bul bg
Burmese mya my
Catalan cat ca
Cebuano ceb ceb
Chinese Simpl zho_simpl zho
Chinese Trad zho_trad zho
Croatian hrv hr
Czech ces cs
Danish dan da
Dutch nld nl
English eng en
Estonian est et
Filipino (Tagalog) tgl tl
Finnish fin fi
French fra fr
Fulah ful ff
Galician glg gl
Ganda lug lg
Georgian kat ka
German deu de
Greek ell el
Gujarati guj gu
Hausa hau ha
Hebrew heb he
Hindi hin hi
Hungarian hun hu
Icelandic isl is
Igbo ibo ig
Indonesian ind id
Irish gle ga
Italian ita it
Japanese jpn ja
Javanese jav jv
Kabuverdianu kea kea
Kamba kam kam
Kannada kan kn
Kazakh kaz kk
Khmer khm km
Korean kor ko
Kyrgyz kir ky
Lao lao lo
Latvian lav lv
Lingala lin ln
Lithuanian lit lt
Luo luo luo
Luxembourgish ltz lb
Macedonian mkd mk
Malay msa ms
Malayalam mal ml
Maltese mlt mt
Maori mri mi
Marathi mar mr
Mongolian mon mn
Nepali npi ne
Northern Sotho nso ns
Norwegian nob no
Nyanja nya ny
Occitan oci oc
Oriya ory or
Oromo orm om
Pashto pus ps
Persian fas fa
Polish pol pl
Portuguese (Brazil) por pt
Punjabi pan pa
Romanian ron ro
Russian rus ru
Serbian srp sr
Shona sna sn
Sindhi snd sd
Slovak slk sk
Slovenian slv sl
Somali som so
Sorani Kurdish ckb ku
Spanish (Latin American) spa es
Swahili swh sw
Swedish swe sv
Tajik tgk tg
Tamil tam ta
Telugu tel te
Thai tha th
Turkish tur tr
Ukrainian ukr uk
Umbundu umb umb
Urdu urd ur
Uzbek uzb uz
Vietnamese vie vi
Welsh cym cy
Wolof wol wo
Xhosa xho xh
Yoruba yor yo
Zulu zul zu

WMT Task

The FLORES-101 dataset is being used for the WMT2021 Large-Scale Multilingual Machine Translation Shared Task. You can learn more about the task HERE. We also provide two pretrained models, downloadable from the WMT task page.

Citation

If you use this data in your work, please cite:

@inproceedings{,
  title={The FLORES-101  Evaluation Benchmark for Low-Resource and Multilingual Machine Translation},
  author={Goyal, Naman and Gao, Cynthia and Chaudhary, Vishrav and Chen, Peng-Jen and Wenzek, Guillaume and Ju, Da and Krishnan, Sanjana and Ranzato, Marc'Aurelio and Guzm\'{a}n, Francisco and Fan, Angela},
  year={2021}
}

@inproceedings{,
  title={Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English},
  author={Guzm\'{a}n, Francisco and Chen, Peng-Jen and Ott, Myle and Pino, Juan and Lample, Guillaume and Koehn, Philipp and Chaudhary, Vishrav and Ranzato, Marc'Aurelio},
  journal={arXiv preprint arXiv:1902.01382},
  year={2019}
}

Changelog

  • 2021-06-04: Released FLORES-101

License

The dataset is licenced under CC-BY-SA, see the LICENSE file for details.