/mRASP

Primary LanguageSmalltalk

Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information, EMNLP2020

This is the repo for EMNLP2020 paper Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information.

[paper]

Introduction

mRASP, representing multilingual Random Aligned Substitution Pre-training, is a pre-trained multilingual neural machine translation model. mRASP is pre-trained on large scale multilingual corpus containing 32 language pairs. The obtained model can be further fine-tuned on downstream language pairs. To effectively bring words and phrases with similar meaning closer in representation across multiple languages, we introduce Random Aligned Substitution (RAS) technique. Extensive experiments conducted on different scenarios demonstrate the efficacy of mRASP. For detailed information please refer to the paper.

Structure

.
├── experiments                             # Example files: including configs and data
├── preprocess                              # The preprocess step
│   ├── tools/
│   │   ├── __init__.py
│   │   ├── common.sh           
│   │   ├── data_preprocess/                # clean + tokenize
│   │   │   ├── __init__.py
│   │   │   ├── clean_scripts/
│   │   │   ├── tokenize_scripts/
│   │   │   ├── clean_each.sh
│   │   │   ├── prep_each.sh
│   │   │   ├── prep_mono.sh                # preprocess a monolingual corpus
│   │   │   ├── prep_parallel.sh            # preprocess a parallel corpus
│   │   │   └── tokenize_each.sh
│   │   ├── misc/
│   │   │   ├── __init__.py
│   │   │   ├── multilingual_preprocess_yml_generator.py
│   │   │   └── multiprocess.sh
│   │   ├── ras/
│   │   │   ├── __init__.py
│   │   │   ├── random_alignment_substitution.sh
│   │   │   ├── random_alignment_substitution_w_multi.sh 
│   │   │   ├── replace_word.py  # RAS using MUSE bilingual dict
│   │   │   └── replace_word_w_multi.py  # RAS using multi-way parallel dict
│   │   └── subword/
│   │       ├── __init__.py
│   │       ├── multilingual_apply_subword_vocab.sh     # script to only apply subword (w/o learning new vocab)
│   │       ├── multilingual_learn_apply_subword_vocab_joint.sh     # script to learn new vocab and apply subword
│   │       └── scripts/
│   ├── __init__.py
│   ├── multilingual_merge.sh               # script to merge multiple parallel dataset
│   ├── multilingual_preprocess_main.sh     # main entry for preprocess
│   └── README.md    
├── train                        
│   ├── __init__.py
│   ├── misc/
│   │   ├── load_config.sh
│   │   └── monitor.sh                  # script to monitor the generation of checkpoint and evaluate them
│   ├── scripts/
│   │   ├── __init__.py
│   │   ├── average_checkpoints_from_file.py
│   │   ├── average_ckpt.sh             # checkpoint average
│   │   ├── common_scripts.sh
│   │   ├── get_worst_ckpt.py
│   │   ├── keep_top_ckpt.py
│   │   ├── remove_bpe.py
│   │   └── rerank_utils.py
│   ├── pre-train.sh                    # main entry for pre-train
│   ├── fine-tune.sh                    # main entry for fine-tune
│   └── README.md
├── requirements.txt
└── README.md

Pre-requisite

pip install -r requirements.txt

Pipeline

The pipeline contains two steps: Pre-train and Fine-tune. We first pre-train our model on multiple language pairs jointly. Then we further fine-tune on downstream language pairs.

Preprocess

The preprocess pipeline is composed of the following 4 separate steps:

  • Data filtering and cleaning

  • Tokenization

  • Learn / Apply joint bpe sub-word vocabulary

  • Random Alignment Substitution (optional, only valid for train set)

We provide a script to run all the above steps in one command:

bash ${PROJECT_ROOT}/preprocess/multilingual_preprocess_main.sh ${config_yaml_file}

Pre-train

step1: preprocess train data and learn a joint BPE subword vocabulary across all languages.

bash ${PROJECT_ROOT}/preprocess/multilingual_preprocess_main.sh ${PROJECT_ROOT}/experiments/example/configs/preprocess/train.yml

The command above will do clean, subword, merge, ras, step by step. Now we have a BPE vocabulary and an RASed multilingual dataset merged from multiple language pairs.

step2: preprocess development data

bash ${PROJECT_ROOT}/preprocess/multilingual_preprocess_main.sh ${PROJECT_ROOT}/experiments/example/configs/preprocess/dev.yml

We create a multilingual development set to help choose the best pre-trained checkpoint.

step3: binarize data

bash ${PROJECT_ROOT}/experiments/example/bin_pretrain.sh

step4: pre-train on RASed multilingual corpus

export CUDA_VISIBLE_DEVICES=0,1,2,3 && bash ${PROJECT_ROOT}/train/pre-train.sh ${PROJECT_ROOT}/experiments/example/configs/train/pre-train/transformer_big.yml

You can modify the configs to choose the model architecture or dataset used.

Fine-tune

step1: preprocess train/test data

bash ${PROJECT_ROOT}/preprocess/multilingual_preprocess_main.sh ${PROJECT_ROOT}/experiments/example/configs/preprocess/train_en2de.yml
bash ${PROJECT_ROOT}/preprocess/multilingual_preprocess_main.sh ${PROJECT_ROOT}/experiments/example/configs/preprocess/test_en2de.yml

The command above will do: clean and subword.

step2: binarize data

bash ${PROJECT_ROOT}/experiments/example/bin_finetune.sh

step3: fine-tune on specific language pairs

export CUDA_VISIBLE_DEVICES=0,1,2 && export EVAL_GPU_INDEX=${eval_gpu_index} && bash ${PROJECT_ROOT}/train/fine-tune.sh ${PROJECT_ROOT}/experiments/example/configs/train/fine-tune/en2de_transformer_big.yml ${PROJECT_ROOT}/experiments/example/configs/eval/en2de_eval.yml
  • eval_gpu_index denotes the index of gpu on your machine that will be allocated to evaluate the model. if you set it to -1, it means that cpu will be used for evaluating during training.

Multilingual Pre-trained Model

Dataset

We merge 32 English-centric language pairs, resulting in 64 directed translation pairs in total. The original 32 language pairs corpus contains about 197M pairs of sentences. We get about 262M pairs of sentences after applying RAS, since we keep both the original sentences and the substituted sentences. We release both the original dataset and dataset after applying RAS.

Dataset #Pair
32-lang-pairs-TRAIN 197603294
32-lang-pairs-RAS-TRAIN 262662792
32-lang-pairs-DEV 156587
Vocab -
BPE Code -

Checkpoints

We release checkpoints trained on 32-lang-pairs and 32-lang-pairs-RAS. We also extend our model to 58 language pairs.

Dataset Checkpoint
Baseline-w/o-RAS mTransformer-6enc6dec
mRASP-PC32 mRASP-PC32-6enc6dec
mRASP-PC58 -

Fine-tuning Model

We release En-Ro, En2De and En2Fr benchmark checkpoints and the corresponding configs.

Lang-Pair Datasource Checkpoints Configs tok-BLEU detok-BLEU
En2Ro WMT16 En-Ro, dev, test en2ro en2ro_config 39.0 37.6
Ro2En WMT16 Ro-En, dev, test ro2en ro2en_config 37.7 36.9
En2De WMT16 En-De, newstest16 en2de en2de_config 30.3 -
En2Fr WMT14 En-Fr, newstest14 en2fr en2fr_config 44.3 -

Comparison with mBART

mBART is a pre-trained model trained on large-scale multilingual corpora. To illustrate the superiority of mRASP, we also compare our results with mBART. We choose different scales of language pairs and use the same test sets as mBART.

Lang-pairs Size Direction Datasource Testset Checkpoint mBART mRASP
En-Gu 10K en_gu_train newstest19 en2gu 0.1 3.2
en_gu_train newstest19 gu2en 0.3 0.6
En-Kk 128K en_kk_train newstest19 en2kk 2.5 8.2
en_kk_train newstest19 kk2en 7.4 12.3
En-Tr 388K en_tr_train newstest17 en2tr 17.8 20.0
en_tr_train newstest17 tr2en 22.5 23.4
En-Et 2.3M en_et_train newstest18 en2et 21.4 20.9
en_et_train newstest18 et2en 27.8 26.8
En-Fi 4M en_fi_train newstest17 en2fi 22.4 24.0
en_fi_train newstest17 fi2en 28.5 28.0
En-Lv 5.5M en_lv_train newstest17 en2lv 15.9 21.6
en_lv_train newstest17 lv2en 19.3 24.4
En-Cs 978K en_cs_train newstest19 en2cs 18.0 19.9
En-De 4.5M en_de_train newstest19 en2de 30.5 35.2
En-Fr 40M en_fr_train newstest14 en2fr 41.0 44.3

Citation

If you are interested in mRASP, please consider citing our paper:

@inproceedings{lin-etal-2020-pre,
    title = "Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information",
    author = "Lin, Zehui  and
      Pan, Xiao  and
      Wang, Mingxuan  and
      Qiu, Xipeng  and
      Feng, Jiangtao  and
      Zhou, Hao  and
      Li, Lei",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.210",
    pages = "2649--2663",
}