Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information, EMNLP2020

This is the repo for EMNLP2020 paper Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information.

[paper]

News

We have evolved our mRASP into mRASP2/mCOLT, which is a much stronger many-to-many multilingual model. mRASP2 has been accepted by ACL2021 main conference. Welcome to use mRASP2.

[paper]
[code]

Introduction

mRASP, representing multilingual Random Aligned Substitution Pre-training, is a pre-trained multilingual neural machine translation model. mRASP is pre-trained on large scale multilingual corpus containing 32 language pairs. The obtained model can be further ﬁne-tuned on downstream language pairs. To effectively bring words and phrases with similar meaning closer in representation across multiple languages, we introduce Random Aligned Substitution (RAS) technique. Extensive experiments conducted on different scenarios demonstrate the efficacy of mRASP. For detailed information please refer to the paper.

Structure

.
├── experiments                             # Example files: including configs and data
├── preprocess                              # The preprocess step
│   ├── tools/
│   │   ├── __init__.py
│   │   ├── common.sh           
│   │   ├── data_preprocess/                # clean + tokenize
│   │   │   ├── __init__.py
│   │   │   ├── clean_scripts/
│   │   │   ├── tokenize_scripts/
│   │   │   ├── clean_each.sh
│   │   │   ├── prep_each.sh
│   │   │   ├── prep_mono.sh                # preprocess a monolingual corpus
│   │   │   ├── prep_parallel.sh            # preprocess a parallel corpus
│   │   │   └── tokenize_each.sh
│   │   ├── misc/
│   │   │   ├── __init__.py
│   │   │   ├── multilingual_preprocess_yml_generator.py
│   │   │   └── multiprocess.sh
│   │   ├── ras/
│   │   │   ├── __init__.py
│   │   │   ├── random_alignment_substitution.sh
│   │   │   ├── random_alignment_substitution_w_multi.sh 
│   │   │   ├── replace_word.py  # RAS using MUSE bilingual dict
│   │   │   └── replace_word_w_multi.py  # RAS using multi-way parallel dict
│   │   └── subword/
│   │       ├── __init__.py
│   │       ├── multilingual_apply_subword_vocab.sh     # script to only apply subword (w/o learning new vocab)
│   │       ├── multilingual_learn_apply_subword_vocab_joint.sh     # script to learn new vocab and apply subword
│   │       └── scripts/
│   ├── __init__.py
│   ├── multilingual_merge.sh               # script to merge multiple parallel dataset
│   ├── multilingual_preprocess_main.sh     # main entry for preprocess
│   └── README.md    
├── train                        
│   ├── __init__.py
│   ├── misc/
│   │   ├── load_config.sh
│   │   └── monitor.sh                  # script to monitor the generation of checkpoint and evaluate them
│   ├── scripts/
│   │   ├── __init__.py
│   │   ├── average_checkpoints_from_file.py
│   │   ├── average_ckpt.sh             # checkpoint average
│   │   ├── common_scripts.sh
│   │   ├── get_worst_ckpt.py
│   │   ├── keep_top_ckpt.py
│   │   ├── remove_bpe.py
│   │   └── rerank_utils.py
│   ├── pre-train.sh                    # main entry for pre-train
│   ├── fine-tune.sh                    # main entry for fine-tune
│   └── README.md
├── requirements.txt
└── README.md

Pre-requisite

pip install -r requirements.txt

Pipeline

The pipeline contains two steps: Pre-train and Fine-tune. We first pre-train our model on multiple language pairs jointly. Then we further fine-tune on downstream language pairs.

Preprocess

The preprocess pipeline is composed of the following 4 separate steps:

Data filtering and cleaning
Tokenization
Learn / Apply joint bpe sub-word vocabulary
Random Alignment Substitution (optional, only valid for train set)

We provide a script to run all the above steps in one command:

bash ${PROJECT_ROOT}/preprocess/multilingual_preprocess_main.sh ${config_yaml_file}

Pre-train

step1: preprocess train data and learn a joint BPE subword vocabulary across all languages.

bash ${PROJECT_ROOT}/preprocess/multilingual_preprocess_main.sh ${PROJECT_ROOT}/experiments/example/configs/preprocess/train.yml

The command above will do clean, subword, merge, ras, step by step. Now we have a BPE vocabulary and an RASed multilingual dataset merged from multiple language pairs.

step2: preprocess development data

bash ${PROJECT_ROOT}/preprocess/multilingual_preprocess_main.sh ${PROJECT_ROOT}/experiments/example/configs/preprocess/dev.yml

We create a multilingual development set to help choose the best pre-trained checkpoint.

step3: binarize data

bash ${PROJECT_ROOT}/experiments/example/bin_pretrain.sh

step4: pre-train on RASed multilingual corpus

export CUDA_VISIBLE_DEVICES=0,1,2,3 && bash ${PROJECT_ROOT}/train/pre-train.sh ${PROJECT_ROOT}/experiments/example/configs/train/pre-train/transformer_big.yml

You can modify the configs to choose the model architecture or dataset used.

Fine-tune

step1: preprocess train/test data

bash ${PROJECT_ROOT}/preprocess/multilingual_preprocess_main.sh ${PROJECT_ROOT}/experiments/example/configs/preprocess/train_en2de.yml
bash ${PROJECT_ROOT}/preprocess/multilingual_preprocess_main.sh ${PROJECT_ROOT}/experiments/example/configs/preprocess/test_en2de.yml

The command above will do: clean and subword.

step2: binarize data

bash ${PROJECT_ROOT}/experiments/example/bin_finetune.sh

step3: fine-tune on specific language pairs

export CUDA_VISIBLE_DEVICES=0,1,2 && export EVAL_GPU_INDEX=${eval_gpu_index} && bash ${PROJECT_ROOT}/train/fine-tune.sh ${PROJECT_ROOT}/experiments/example/configs/train/fine-tune/en2de_transformer_big.yml ${PROJECT_ROOT}/experiments/example/configs/eval/en2de_eval.yml

eval_gpu_index denotes the index of gpu on your machine that will be allocated to evaluate the model. if you set it to -1, it means that cpu will be used for evaluating during training.

Multilingual Pre-trained Model

Dataset

We merge 32 English-centric language pairs, resulting in 64 directed translation pairs in total. The original 32 language pairs corpus contains about 197M pairs of sentences. We get about 262M pairs of sentences after applying RAS, since we keep both the original sentences and the substituted sentences. We release both the original dataset and dataset after applying RAS. (Note that if you can't download the files, please replace the download link prefix "sf3-ttcdn-tos.pstatp.com" with "lf3-nlp-opensource.bytetos.com".)

Dataset	#Pair
32-lang-pairs-TRAIN	197603294
32-lang-pairs-RAS-TRAIN	262662792
32-lang-pairs-DEV	156587
Vocab	-
BPE Code	-

Checkpoints

We release checkpoints trained on 32-lang-pairs and 32-lang-pairs-RAS. We also extend our model to 58 language pairs.

Dataset	Checkpoint
Baseline-w/o-RAS	mTransformer-6enc6dec
mRASP-PC32	mRASP-PC32-6enc6dec
mRASP-PC58	-

Fine-tuning Model

We release En-Ro, En2De and En2Fr benchmark checkpoints and the corresponding configs.

Lang-Pair	Datasource	Checkpoints	Configs	tok-BLEU	detok-BLEU
En2Ro	WMT16 En-Ro, dev, test	en2ro	en2ro_config	39.0	37.6
Ro2En	WMT16 Ro-En, dev, test	ro2en	ro2en_config	37.7	36.9
En2De	WMT16 En-De, newstest16	en2de	en2de_config	30.3	-
En2Fr	WMT14 En-Fr, newstest14	en2fr	en2fr_config	44.3	-

Comparison with mBART

mBART is a pre-trained model trained on large-scale multilingual corpora. To illustrate the superiority of mRASP, we also compare our results with mBART. We choose different scales of language pairs and use the same test sets as mBART.

Lang-pairs	Size	Direction	Datasource	Testset	Checkpoint	mBART	mRASP
En-Gu	10K	⟶	en_gu_train	newstest19	en2gu	0.1	3.2
		⟵	en_gu_train	newstest19	gu2en	0.3	0.6
En-Kk	128K	⟶	en_kk_train	newstest19	en2kk	2.5	8.2
		⟵	en_kk_train	newstest19	kk2en	7.4	12.3
En-Tr	388K	⟶	en_tr_train	newstest17	en2tr	17.8	20.0
		⟵	en_tr_train	newstest17	tr2en	22.5	23.4
En-Et	2.3M	⟶	en_et_train	newstest18	en2et	21.4	20.9
		⟵	en_et_train	newstest18	et2en	27.8	26.8
En-Fi	4M	⟶	en_fi_train	newstest17	en2fi	22.4	24.0
		⟵	en_fi_train	newstest17	fi2en	28.5	28.0
En-Lv	5.5M	⟶	en_lv_train	newstest17	en2lv	15.9	21.6
		⟵	en_lv_train	newstest17	lv2en	19.3	24.4
En-Cs	978K	⟶	en_cs_train	newstest19	en2cs	18.0	19.9
En-De	4.5M	⟶	en_de_train	newstest19	en2de	30.5	35.2
En-Fr	40M	⟶	en_fr_train	newstest14	en2fr	41.0	44.3

Citation

If you are interested in mRASP, please consider citing our paper:

@inproceedings{lin-etal-2020-pre,
    title = "Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information",
    author = "Lin, Zehui  and
      Pan, Xiao  and
      Wang, Mingxuan  and
      Qiu, Xipeng  and
      Feng, Jiangtao  and
      Zhou, Hao  and
      Li, Lei",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.210",
    pages = "2649--2663",
}

gpengzhi/mRASP