Multilingual T5 (mT5) is a massively multilingual pretrained text-to-text transformer model, trained following a similar recipe as T5. This repo can be used to reproduce the experiments in the mT5 paper.
mT5 is pretrained on the mC4 corpus, covering 101 languages:
Afrikaans, Albanian, Amharic, Arabic, Armenian, Azerbaijani, Basque, Belarusian, Bengali, Bulgarian, Burmese, Catalan, Cebuano, Chichewa, Chinese, Corsican, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hmong, Hungarian, Icelandic, Igbo, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish, Kyrgyz, Lao, Latin, Latvian, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Nepali, Norwegian, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Samoan, Scottish Gaelic, Serbian, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Sotho, Spanish, Sundanese, Swahili, Swedish, Tajik, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, West Frisian, Xhosa, Yiddish, Yoruba, Zulu.
mT5 achieves state-of-the-art performance on many cross-lingual NLP tasks, as of November 2020. For example, on XTREME zero-shot classification, structured prediction and QA tasks (showing F1 scores):
Model | XNLI | PAWS-X | WikiAnn-NER | XQuAD | MLQA | TyDiQA-GoldP |
---|---|---|---|---|---|---|
mBERT | 65.4 | 81.9 | 62.2 | 64.5 | 61.4 | 59.7 |
XLM | 69.1 | 80.9 | 61.2 | 59.8 | 48.5 | 43.6 |
InfoXLM | 81.4 | - | - | - | 73.6 | - |
X-STILTs | 80.4 | 87.7 | 64.7 | 77.2 | 72.3 | 76.0 |
XLM-R | 79.2 | 86.4 | 65.4 | 76.6 | 71.6 | 65.1 |
VECO | 79.9 | 88.7 | 65.7 | 77.3 | 71.7 | 67.6 |
RemBERT | 80.8 | 87.5 | 70.1 | 79.6 | 73.1 | 77.0 |
mT5-Small | 67.5 | 82.4 | 50.5 | 58.1 | 54.6 | 35.2 |
mT5-Base | 75.4 | 86.4 | 55.7 | 67.0 | 64.6 | 57.2 |
mT5-Large | 81.1 | 88.9 | 58.5 | 77.8 | 71.2 | 69.9 |
mT5-XL | 82.9 | 89.6 | 65.5 | 79.5 | 73.5 | 75.9 |
mT5-XXL | 85.0 | 90.0 | 69.2 | 82.5 | 76.0 | 80.8 |
To run this code, you need to install the t5
library. General instructions for training,
fine-tuning, evaluation, and exporting models for inference can be found in the
t5
repo. In
order to use the additional mT5 tasks provided in this library with the
t5_mesh_transformer
command, run from this directory and add the flag
--module_import="multilingual_t5.tasks"
. There is also support for mT5 in
HuggingFace; see
instructions in the T5 repo
here.
To train an mT5-Large
model on the
mc4
task from scratch as described in the paper:
export PROJECT=yourproject
export ZONE=yourzone
export BUCKET=yourbucket
export TPU=yourtpu
ctpu up --name=$TPU --project=$PROJECT --zone=$ZONE --tpu-size=v3-256 --tpu-only --noconf
TASK=mc4
MODEL_DIR="${BUCKET}${TASK}"
python -m t5.models.mesh_transformer_main \
--tpu="${TPU}" \
--gcp_project="${PROJECT}" \
--tpu_zone="${ZONE}" \
--model_dir="${MODEL_DIR}" \
--gin_file="models/t5.1.1.large.gin" \
--gin_param="MIXTURE_NAME = '${TASK}'" \
--gin_param="utils.run.sequence_length = {'inputs': 1024, 'targets': 256}" \
--gin_param="utils.run.batch_size = ('tokens_per_batch', 1048576)" \
--gin_param="utils.run.learning_rate_schedule=@learning_rate_schedules.rsqrt_no_ramp_down" \
--gin_param="run.train_steps = 1000000" \
--gin_param="utils.tpu_mesh_shape.model_parallelism = 1" \
--gin_param="utils.tpu_mesh_shape.tpu_topology = 'v3-256'" \
--eval_mode="perplexity_eval" \
--eval_gin_param="mesh_eval_dataset_fn.num_eval_examples = 10000" \
--t5_tfds_data_dir="${BUCKET}/t5-tfds" \
--module_import="multilingual_t5.tasks"
As an example, to finetune the mT5-Large
model on the XNLI_zeroshot
task:
export PROJECT=yourproject
export ZONE=yourzone
export BUCKET=yourbucket
export TPU=yourtpu
ctpu up --name=$TPU --project=$PROJECT --zone=$ZONE --tpu-size=v3-256 --tpu-only --noconf
TASK=xnli_zeroshot
SEQUENCE_LENGTH_GIN=xnli
PRETRAINED_DIR=gs://t5-data/pretrained_models/mt5/large
PRETRAINED_STEPS=1000000
FINETUNE_STEPS=20000
MODEL_DIR="${BUCKET}${TASK}"
# Run fine-tuning
python -m t5.models.mesh_transformer_main \
--tpu="${TPU}" \
--gcp_project="${PROJECT}" \
--tpu_zone="${ZONE}" \
--model_dir="${MODEL_DIR}" \
--gin_file="${PRETRAINED_DIR}/operative_config.gin" \
--gin_file="sequence_lengths/${SEQUENCE_LENGTH_GIN}.gin" \
--gin_param="utils.tpu_mesh_shape.tpu_topology = 'v3-256'" \
--gin_param="MIXTURE_NAME = '${TASK}'" \
--gin_param="utils.run.train_steps=$((PRETRAINED_STEPS+FINETUNE_STEPS))" \
--gin_param="utils.run.init_checkpoint='${PRETRAINED_DIR}/model.ckpt-${PRETRAINED_STEPS}'" \
--t5_tfds_data_dir="${BUCKET}/t5-tfds" \
--module_import="multilingual_t5.tasks" \
--gin_location_prefix="multilingual_t5/gin/"
The remaining experiments are shown in the tasks.py file.
We have released the following checkpoints for pre-trained models described in our paper:
- mT5-Small (300 million parameters): gs://t5-data/pretrained_models/mt5/small
- mT5-Base (600 million parameters): gs://t5-data/pretrained_models/mt5/base
- mT5-Large (1.2 billion parameters): gs://t5-data/pretrained_models/mt5/large
- mT5-XL (4 billion parameters): gs://t5-data/pretrained_models/mt5/xl
- mT5-XXL (13 billion parameters): gs://t5-data/pretrained_models/mt5/xxl
If you extend or use this work, please cite the paper where it was introduced:
@misc{xue2020mt5,
title = {{mT5}: A massively multilingual pre-trained text-to-text transformer},
author = {Linting Xue and Noah Constant and Adam Roberts and Mihir Kale and Rami Al-Rfou and Aditya Siddhant and Aditya Barua and Colin Raffel},
year = {2020},
eprint = {2010.11934},
archivePrefix = {arXiv},
primaryClass = {cs.CL}
}