All models are trained with distillation. See our paper.
Pretrained Models | |||
---|---|---|---|
WMT14 English-German | WMT14 German-English | WMT16 English-Romanian | WMT16 Romanian-English |
WMT17 English-Chinese | WMT17 Chinese-English | WMT14 English-French |
We also provide our knowledge distillation data for WMT14 EN-DE.
text=PATH_YOUR_DATA
output_dir=PATH_YOUR_OUTPUT
src=source_language
tgt=target_language
model_path=PATH_TO_MASKPREDICT_MODEL_DIR
python preprocess.py --source-lang ${src} --target-lang ${tgt} --trainpref $text/train \
--validpref $text/valid --testpref $text/test --destdir ${output_dir}/data-bin \
--workers 60 --srcdict ${model_path}/maskPredict_${src}_${tgt}/dict.${src}.txt \
--tgtdict ${model_path}/maskPredict_${src}_${tgt}/dict.${tgt}.txt
model_dir=PLACE_TO_SAVE_YOUR_MODEL
python train.py ${output_dir}/data-bin --arch disco_transformer \
--criterion label_smoothed_length_cross_entropy --label-smoothing 0.1 --lr 5e-4 \
--warmup-init-lr 1e-7 --min-lr 1e-9 --lr-scheduler inverse_sqrt --warmup-updates 10000 \
--optimizer adam --adam-betas '(0.9, 0.999)' --adam-eps 1e-6 --task translation_self \
--max-tokens 8192 --weight-decay 0.01 --dropout 0.2 --encoder-layers 6 --encoder-embed-dim 512 \
--decoder-layers 6 --decoder-embed-dim 512 --fp16 --max-source-positions 10000 \
--max-target-positions 10000 --max-update 300000 --seed 1 \
--save-dir ${model_dir} --dynamic-masking --ignore-eos-loss --share-all-embeddings
We provide two inference methods:
- Parallel Easy-First (Kasai et al., 2020)
python generate_disco.py ${output_dir}/data-bin --path ${model_dir}/checkpoint_best.pt \
--task translation_self --remove-bpe --max-sentences 20 --decoding-iterations 10 \
--decoding-strategy easy_first --length-beam 5
- Mask-Predict (Ghazvininejad et al., 2019)
python generate_disco.py ${output_dir}/data-bin --path ${model_dir}/checkpoint_best.pt \
--task translation_self --remove-bpe --max-sentences 20 --decoding-iterations 10 \
--decoding-strategy mask_predict --length-beam 5
DisCo is CC-BY-NC 4.0. The license applies to the trained models as well.
Please cite as:
@inproceedings{Kasai2020DisCo,
title = {Non-autoregressive Machine Translation with Disentangled Context Transformer},
author = {Jungo Kasai and James Cross and Marjan Ghazvininejad and Jiatao Gu},
booktitle = {Proc. of ICML},
year = {2020},
url = {https://arxiv.org/abs/2001.05136},
}
We based this code heavily on the original mask-predict implementation.