FLORES-101
is a Many-to-Many multilingual translation benchmark dataset for 101 languages.
-
Paper: The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation.
-
Download
FLORES-101
dataset. -
Evaluation server: dynabench, Instructions to submit model
Looking for FLORESv1, which included Nepali, Sinhala, Pashto, and Khmer? Click here
One of the biggest challenges hindering progress in low-resource and multilingual machine translation is the lack of good evaluation benchmarks. Current evaluation benchmarks either lack good coverage of low-resource languages, consider only restricted domains, or are low quality because they are constructed using semi-automatic procedures. In this work, we introduce the FLORES evaluation benchmark, consisting of 3001 sentences extracted from English Wikipedia and covering a variety of different topics and domains. These sentences have been translated in 101 languages by professional translators through a carefully controlled process. The resulting dataset enables better assessment of model quality on the long tail of low-resource languages, including the evaluation of many-to-many multilingual translation systems, as all translations are multilingually aligned. By publicly releasing such a high-quality and high-coverage dataset, we hope to foster progress in the machine translation community and beyond.
The data can be downloaded from: Here.
For evaluation, we use SentencePiece BLEU (spBLEU) which uses a SentencePiece (SPM) tokenizer with 256K tokens and then BLEU score is computed on the sentence-piece tokenized text. This requires installing sacrebleu using a specific branch:
git clone --single-branch --branch adding_spm_tokenized_bleu https://github.com/ngoyal2707/sacrebleu.git
cd sacrebleu
python setup.py install
cd ~/
wget https://dl.fbaipublicfiles.com/flores101/dataset/flores101_dataset.tar.gz
tar -xvzf flores101_dataset.tar.gz
Instructions for computing spBLEU for detokenized translations generated by a model
flores101_devtest=flores101_dataset/devtest
# Path to generated detokenized translations file
translation_file=/path/to/detok_trans.txt
# Set the target language (for this example, English)
trg_lang=eng
cat $translation_file | sacrebleu -tok spm $flores101_devtest/${trg_lang}.devtest
Following example walks shows evaluating released M2M-124 615M
model on an example language pair of Nyanja -> Swahili
on FLORES-101
devtest
which achieves 12.4
spBLEU.
fairseq=/path/to/fairseq
cd $fairseq
# Download 615M param model.
wget https://dl.fbaipublicfiles.com/flores101/pretrained_models/flores101_mm100_615M.tar.gz
# Extract
tar -xvzf flores101_mm100_615M.tar.gz
Note: Install SentencePiece from here
flores101_dataset=/path/to/flores_dataset
fairseq=/path/to/fairseq
cd $fairseq
# Example lang pair translation: Nyanja -> Swahili
# MM100 code for Nyanja and Swahili: ny, sw
SRC_LANG_CODE=nya
TRG_LANG_CODE=swh
SRC_MM100_LANG_CODE=ny
TRG_MM100_LANG_CODE=sw
python scripts/spm_encode.py \
--model flores101_mm100_615M/sentencepiece.bpe.model \
--output_format=piece \
--inputs=$flores101_dataset/devtest/${SRC_LANG_CODE}.devtest \
--outputs=spm.${SRC_MM100_LANG_CODE}-${TRG_MM100_LANG_CODE}.${SRC_MM100_LANG_CODE}
python scripts/spm_encode.py \
--model flores101_mm100_615M/sentencepiece.bpe.model \
--output_format=piece \
--inputs=$flores101_dataset/devtest/${TRG_LANG_CODE}.devtest \
--outputs=spm.${SRC_MM100_LANG_CODE}-${TRG_MM100_LANG_CODE}.${TRG_MM100_LANG_CODE}
fairseq-preprocess \
--source-lang ${SRC_MM100_LANG_CODE} --target-lang ${TRG_MM100_LANG_CODE} \
--testpref spm.${SRC_MM100_LANG_CODE}-${TRG_MM100_LANG_CODE} \
--thresholdsrc 0 --thresholdtgt 0 \
--destdir data_bin_${SRC_MM100_LANG_CODE}_${TRG_MM100_LANG_CODE} \
--srcdict flores101_mm100_615M/dict.txt --tgtdict flores101_mm100_615M/dict.txt
fairseq-generate \
data_bin_${SRC_MM100_LANG_CODE}_${TRG_MM100_LANG_CODE} \
--batch-size 1 \
--path flores101_mm100_615M/model.pt \
--fixed-dictionary flores101_mm100_615M/dict.txt \
-s ${SRC_MM100_LANG_CODE} -t ${TRG_MM100_LANG_CODE} \
--remove-bpe 'sentencepiece' \
--beam 5 \
--task translation_multi_simple_epoch \
--lang-pairs flores101_mm100_615M/language_pairs.txt \
--decoder-langtok --encoder-langtok src \
--gen-subset test \
--fp16 \
--dataset-impl mmap \
--distributed-world-size 1 --distributed-no-spawn \
--results-path generation_${SRC_MM100_LANG_CODE}_${TRG_MM100_LANG_CODE}
# clean fairseq generated file to only create hypotheses file.
cat generation_${SRC_MM100_LANG_CODE}_${TRG_MM100_LANG_CODE}/generate-test.txt | grep -P '^H-' | cut -c 3- | sort -n -k 1 | awk -F "\t" '{print $NF}' > generation_${SRC_MM100_LANG_CODE}_${TRG_MM100_LANG_CODE}/sys.txt
# Get score
sacrebleu flores101_dataset/devtest/${TRG_LANG_CODE}.devtest < generation_${SRC_MM100_LANG_CODE}_${TRG_MM100_LANG_CODE}/sys.txt --tokenize spm
# Expected Outcome:
# BLEU+case.mixed+numrefs.1+smooth.exp+tok.spm+version.1.5.0 = 12.4 34.9/15.8/8.7/4.9 (BP = 1.000 ratio = 1.007 hyp_len = 37247 ref_len = 36999)
Language | FLORES-101 code | MM100 lang code |
---|---|---|
Akrikaans | afr | af |
Amharic | amh | am |
Arabic | ara | ar |
Armenian | hye | hy |
Assamese | asm | as |
Asturian | ast | ast |
Azerbaijani | azj | az |
Belarusian | bel | be |
Bengali | ben | bn |
Bosnian | bos | bs |
Bulgarian | bul | bg |
Burmese | mya | my |
Catalan | cat | ca |
Cebuano | ceb | ceb |
Chinese Simpl | zho_simpl | zho |
Chinese Trad | zho_trad | zho |
Croatian | hrv | hr |
Czech | ces | cs |
Danish | dan | da |
Dutch | nld | nl |
English | eng | en |
Estonian | est | et |
Filipino (Tagalog) | tgl | tl |
Finnish | fin | fi |
French | fra | fr |
Fulah | ful | ff |
Galician | glg | gl |
Ganda | lug | lg |
Georgian | kat | ka |
German | deu | de |
Greek | ell | el |
Gujarati | guj | gu |
Hausa | hau | ha |
Hebrew | heb | he |
Hindi | hin | hi |
Hungarian | hun | hu |
Icelandic | isl | is |
Igbo | ibo | ig |
Indonesian | ind | id |
Irish | gle | ga |
Italian | ita | it |
Japanese | jpn | ja |
Javanese | jav | jv |
Kabuverdianu | kea | kea |
Kamba | kam | kam |
Kannada | kan | kn |
Kazakh | kaz | kk |
Khmer | khm | km |
Korean | kor | ko |
Kyrgyz | kir | ky |
Lao | lao | lo |
Latvian | lav | lv |
Lingala | lin | ln |
Lithuanian | lit | lt |
Luo | luo | luo |
Luxembourgish | ltz | lb |
Macedonian | mkd | mk |
Malay | msa | ms |
Malayalam | mal | ml |
Maltese | mlt | mt |
Maori | mri | mi |
Marathi | mar | mr |
Mongolian | mon | mn |
Nepali | npi | ne |
Northern Sotho | nso | ns |
Norwegian | nob | no |
Nyanja | nya | ny |
Occitan | oci | oc |
Oriya | ory | or |
Oromo | orm | om |
Pashto | pus | ps |
Persian | fas | fa |
Polish | pol | pl |
Portuguese (Brazil) | por | pt |
Punjabi | pan | pa |
Romanian | ron | ro |
Russian | rus | ru |
Serbian | srp | sr |
Shona | sna | sn |
Sindhi | snd | sd |
Slovak | slk | sk |
Slovenian | slv | sl |
Somali | som | so |
Sorani Kurdish | ckb | ku |
Spanish (Latin American) | spa | es |
Swahili | swh | sw |
Swedish | swe | sv |
Tajik | tgk | tg |
Tamil | tam | ta |
Telugu | tel | te |
Thai | tha | th |
Turkish | tur | tr |
Ukrainian | ukr | uk |
Umbundu | umb | umb |
Urdu | urd | ur |
Uzbek | uzb | uz |
Vietnamese | vie | vi |
Welsh | cym | cy |
Wolof | wol | wo |
Xhosa | xho | xh |
Yoruba | yor | yo |
Zulu | zul | zu |
The FLORES-101 dataset is being used for the WMT2021 Large-Scale Multilingual Machine Translation Shared Task. You can learn more about the task HERE. We also provide two pretrained models, downloadable from the WMT task page.
If you use this data in your work, please cite:
@inproceedings{,
title={The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation},
author={Goyal, Naman and Gao, Cynthia and Chaudhary, Vishrav and Chen, Peng-Jen and Wenzek, Guillaume and Ju, Da and Krishnan, Sanjana and Ranzato, Marc'Aurelio and Guzm\'{a}n, Francisco and Fan, Angela},
year={2021}
}
@inproceedings{,
title={Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English},
author={Guzm\'{a}n, Francisco and Chen, Peng-Jen and Ott, Myle and Pino, Juan and Lample, Guillaume and Koehn, Philipp and Chaudhary, Vishrav and Ranzato, Marc'Aurelio},
journal={arXiv preprint arXiv:1902.01382},
year={2019}
}
- 2021-06-04: Released FLORES-101
The dataset is licenced under CC-BY-SA, see the LICENSE file for details.