We are excited to introduce a new larger and better quality Machine Translation dataset, MTet, which stands for Multi-domain Translation for English and VieTnamese. In our new release, we extend our previous dataset (v1.0) by adding more high-quality English-Vietnamese sentence pairs on various domains. In addition, we also show our new larger Transformer models can achieve state-of-the-art results on multiple test sets.
Get data and model at Google Cloud Storage
Visit our blog post for more details.
This code is build on top of vietai/dab:
To prepare for training, generate tfrecords
from raw text files:
python t2t_datagen.py \
--data_dir=$path_to_folder_contains_vocab_file \
--tmp_dir=$path_to_folder_that_contains_training_data \
--problem=$problem
To train a Transformer model on the generated tfrecords
python t2t_trainer.py \
--data_dir=$path_to_folder_contains_vocab_file_and_tf_records \
--problem=$problem \
--hparams_set=$hparams_set \
--model=transformer \
--output_dir=$path_to_folder_to_save_checkpoints
To run inference on the trained model:
python t2t_decoder.py \
--data_dir=$path_to_folde_contains_vocab_file_and_tf_records \
--problem=$problem \
--hparams_set=$hparams_set \
--model=transformer \
--output_dir=$path_to_folder_contains_checkpoints \
--checkpoint_path=$path_to_checkpoint
In this colab, we demonstrated how to run these three phases in the context of hosting data/model on Google Cloud Storage.
Our data contains roughly 4.2 million pairs of texts, ranging across multiple different domains such as medical publications, religious texts, engineering articles, literature, news, and poems. A more detail breakdown of our data is shown in the table below.
v1 | v2 (MTet) | |
---|---|---|
Fictional Books | 333,189 | 473,306 |
Legal Document | 1,150,266 | 1,134,813 |
Medical Publication | 5,861 | 13,410 |
Movies Subtitles | 250,000 | 721,174 |
Software | 79,912 | 79,132 |
TED Talk | 352,652 | 303,131 |
Wikipedia | 645,326 | 1,094,248 |
News | 18,449 | 18,389 |
Religious texts | 124,389 | 48,927 |
Educational content | 397,008 | 213,284 |
No tag | 5,517 | 63,896 |
Total | 3,362,569 | 4,163,710 |
Data sources is described in more details here.
We would like to thank Google for the support of Cloud credits and TPU quota!