Source code for our EMNLP 2021 paper "Universal Simultaneous Machine Translation with Mixture-of-Experts Wait-k Policy" [PDF]
Our method is implemented based on the open-source toolkit Fairseq.
-
Python version = 3.6
-
PyTorch version = 1.7
-
Install fairseq:
git clone https://github.com/ictnlp/MoE-Waitk.git cd MoE-Waitk pip install --editable ./
We use the data of IWSLT15 English-Vietnamese (download here) and WMT15 German-English (download here), and apply BPE with 32K merge operations on WMT15 German-English via subword_nmt/apply_bpe.py.
Then, we process the data into the fairseq format:
src=SOURCE_LANGUAGE
tgt=TARGET_LANGUAGE
train_data=PATH_TO_TRAIN_DATA
vaild_data=PATH_TO_VALID_DATA
test_data=PATH_TO_TEST_DATA
data=PATH_TO_DATA
fairseq-preprocess --source-lang ${src} --target-lang ${tgt} \
--trainpref ${train_data} --validpref ${vaild_data} \
--testpref ${test_data}\
--destdir ${data} \
--workers 20
Train MoE Wait-k Policy in two stage, according to the following command:
- For Transformer-Small with 4 attention heads: we set expert lagging = 1,6,11,16
- For Transformer-Base with 8 attention heads: we set expert lagging = 1,3,5,7,9,11,13,15
- For Transformer-Big with 16 attention heads: we set expert lagging = 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
- First-stage: fix the expert weights equal, and pre-train expert parameters.
export CUDA_VISIBLE_DEVICES=0,1,2,3
data=PATH_TO_DATA
modelfile=PATH_TO_SAVE_MODEL
expert_lagging=SET_EXPERT_LAGGING #1,3,5,7,9,11,13,15
# Fisrt-stage: Pertrain an equal-weight MoE Wait-k
python train.py --ddp-backend=no_c10d ${data} --arch transformer --share-all-embeddings \
--optimizer adam \
--adam-betas '(0.9, 0.98)' \
--clip-norm 0.0 \
--lr 5e-4 \
--lr-scheduler inverse_sqrt \
--warmup-init-lr 1e-07 \
--warmup-updates 4000 \
--dropout 0.3 \
--criterion label_smoothed_cross_entropy \
--reset-dataloader --reset-lr-scheduler --reset-optimizer\
--label-smoothing 0.1 \
--encoder-attention-heads 8 \
--decoder-attention-heads 8 \
--left-pad-source False \
--fp16 \
--equal-weight \
--expert-lagging ${expert_lagging} \
--save-dir ${modelfile} \
--max-tokens 4096 --update-freq 2
- Second-stage: jointly fine-tune the parameters of experts and their weights.
# Sencond-stage: Finetune MoE Wait-k with various expert weights
python train.py --ddp-backend=no_c10d ${data} --arch transformer --share-all-embeddings \
--optimizer adam \
--adam-betas '(0.9, 0.98)' \
--clip-norm 0.0 \
--lr 5e-4 \
--lr-scheduler inverse_sqrt \
--warmup-init-lr 1e-07 \
--warmup-updates 4000 \
--dropout 0.3 \
--criterion label_smoothed_cross_entropy \
--reset-dataloader --reset-lr-scheduler --reset-optimizer\
--label-smoothing 0.1 \
--encoder-attention-heads 8 \
--decoder-attention-heads 8 \
--left-pad-source False \
--fp16 \
--expert-lagging ${expert_lagging} \
--save-dir ${modelfile} \
--max-tokens 4096 --update-freq 2
Evaluate the model with the following command:
export CUDA_VISIBLE_DEVICES=0
data=PATH_TO_DATA
modelfile = PATH_TO_SAVE_MODEL
ref_dir=PATH_TO_REFERENCE
testk=TEST_WAIT_K
# average last 5 checkpoints
python scripts/average_checkpoints.py --inputs ${modelfile} --num-update-checkpoints 5 --output ${modelfile}/average-model.pt
# generate translation
python generate.py ${data} --path $modelfile/average-model.pt --batch-size 250 --beam 1 --left-pad-source False --fp16 --remove-bpe --test-wait-k ${testk} > pred.out
grep ^H pred.out | cut -f1,3- | cut -c3- | sort -k1n | cut -f2- > pred.translation
multi-bleu.perl -lc ${ref_dir} < pred.translation
In this repository is useful for you, please cite as:
@inproceedings{zhang-feng-2021-universal,
title = "Universal Simultaneous Machine Translation with Mixture-of-Experts Wait-k Policy",
author = "Zhang, Shaolei and
Feng, Yang",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2021",
address = "Online and Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.emnlp-main.581",
doi = "10.18653/v1/2021.emnlp-main.581",
pages = "7306--7317",
}