This repository contains the source code, pre-trained models, as well as instructions to reproduce the results or our paper
@article{fonollosa2019joint,
title={Joint Source-Target Self Attention with Locality Constraints},
author={Jos\'e A. R. Fonollosa and Noe Casas and Marta R. Costa-juss\`a},
journal={arXiv preprint arXiv:1905.06596},
url={http://arxiv.org/abs/1905.06596}
year={2019}
}
- PyTorch version >= 1.0.0
- fairseq version >= 0.6.2
- Python version >= 3.6
- For training new models, you'll also need an NVIDIA GPU and NCCL
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable .
cd ..
git clone https://github.com/jarfo/joint.git
cd joint
Dataset | Model | Prepared test set |
---|---|---|
IWSLT14 German-English | download (.pt) | IWSLT14 test: download (.tgz) |
WMT16 English-German | download (.bz2) | newstest2014 (shared vocab): download (.tgz) |
WMT14 English-French | download (split #1) download (split #2) |
newstest2014 (shared vocab): download (.tgz) |
The English-French model download is split in two files that can be joined with:
cat local_joint_attention_wmt_en_fr_big.pt_* > local_joint_attention_wmt_en_fr_big.pt
The IWSLT'14 German to English translation database "Report on the 11th IWSLT evaluation campaign" by Cettolo et al. is tokenized with a joint BPE with 31K tokens
# Dataset download and preparation
cd examples
./prepare-iwslt14-31K.sh
cd ..
# Dataset binarization:
TEXT=examples/iwslt14.tokenized.31K.de-en
fairseq-preprocess --joined-dictionary --source-lang de --target-lang en \
--trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
--destdir data-bin/iwslt14.joined-dictionary.31K.de-en
Training and evaluating Local Joint Attention on a GPU:
# Training
SAVE="checkpoints/local_joint_attention_iwslt_de_en"
mkdir -p $SAVE
fairseq-train data-bin/iwslt14.joined-dictionary.31K.de-en \
--user-dir models \
--arch local_joint_attention_iwslt_de_en \
--clip-norm 0 --optimizer adam --lr 0.001 \
--source-lang de --target-lang en --max-tokens 4000 --no-progress-bar \
--log-interval 100 --min-lr '1e-09' --weight-decay 0.0001 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--lr-scheduler inverse_sqrt \
--ddp-backend=no_c10d \
--max-update 85000 --warmup-updates 4000 --warmup-init-lr '1e-07' \
--adam-betas '(0.9, 0.98)' --adam-eps '1e-09' --keep-last-epochs 10 \
--arch local_joint_attention_iwslt_de_en --share-all-embeddings \
--save-dir $SAVE
python scripts/average_checkpoints.py --inputs $SAVE \
--num-epoch-checkpoints 10 --output "${SAVE}/checkpoint_last10_avg.pt"
# Evaluation
fairseq-generate data-bin/iwslt14.joined-dictionary.31K.de-en --user-dir models \
--path "${SAVE}/checkpoint_last10_avg.pt" \
--batch-size 32 --beam 5 --remove-bpe --lenpen 1.7 --gen-subset test --quiet
Training Local Joint Attention on WMT16 En-De using cosine scheduler on one machine with 8 Nvidia V100-16GB GPUs:
Download the preprocessed WMT'16 En-De data provided by Google. Then
Extract the WMT'16 En-De data:
$ TEXT=wmt16_en_de_bpe32k
$ mkdir $TEXT
$ tar -xzvf wmt16_en_de.tar.gz -C $TEXT
Preprocess the dataset with a joined dictionary:
$ fairseq-preprocess --source-lang en --target-lang de \
--trainpref $TEXT/train.tok.clean.bpe.32000 \
--validpref $TEXT/newstest2013.tok.bpe.32000 \
--testpref $TEXT/newstest2014.tok.bpe.32000 \
--destdir data-bin/wmt16_en_de_bpe32k \
--nwordssrc 32768 --nwordstgt 32768 \
--joined-dictionary
Train a model
# Training
SAVE="save/joint_attention_wmt_en_de_big"
mkdir -p $SAVE
python -m torch.distributed.launch --nproc_per_node 8 fairseq-train \
data-bin/wmt16_en_de_bpe32k \
--user-dir models \
--arch local_joint_attention_wmt_en_de_big \
--fp16 --log-interval 100 --no-progress-bar \
--max-update 30000 --share-all-embeddings --optimizer adam \
--adam-betas '(0.9, 0.98)' \
--clip-norm 0.0 --weight-decay 0.0 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--min-lr 1e-09 --update-freq 32 --keep-last-epochs 10 \
--ddp-backend=no_c10d --max-tokens 1800 \
--lr-scheduler cosine --warmup-init-lr 1e-7 --warmup-updates 10000 \
--lr-shrink 1 --max-lr 0.0009 --lr 1e-7 --min-lr 1e-9 --warmup-init-lr 1e-07 \
--t-mult 1 --lr-period-updates 20000 \
--save-dir $SAVE
# Checkpoint averaging
python ../fairseq/scripts/average_checkpoints.py --inputs $SAVE \
--num-epoch-checkpoints 10 --output "${SAVE}/checkpoint_last10_avg.pt"
# Evaluation on newstest2014
CUDA_VISIBLE_DEVICES=0 fairseq-generate $DATA --user-dir models \
--path "${SAVE}/checkpoint_last10_avg.pt" \
--batch-size 32 --beam 5 --remove-bpe --lenpen 0.35 --gen-subset test > wmt16_gen.txt
bash ../fairseq/scripts/compound_split_bleu.sh wmt16_gen.txt
Training and evaluating Local Joint Attention on WMT14 En-Fr using cosine scheduler on one machine with 8 Nvidia V100-16GB GPUs:
# Data preparation
$ cd examples
$ bash prepare-wmt14en2fr.sh
$ cd ..
# Binarize the dataset:
$ TEXT=examples/wmt14_en_fr
$ fairseq-preprocess --joined-dictionary --source-lang en --target-lang fr \
--trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
--destdir data-bin/wmt14_en_fr --thresholdtgt 0 --thresholdsrc 0
# Training
SAVE="save/dynamic_conv_wmt14en2fr"
mkdir -p $SAVE
python -m torch.distributed.launch --nproc_per_node 8 fairseq-train \
data-bin/wmt14_en_fr \
--user-dir models \
--arch local_joint_attention_wmt_en_fr_big \
--fp16 --log-interval 100 --no-progress-bar \
--max-update 80000 --share-all-embeddings --optimizer adam \
--adam-betas '(0.9, 0.98)' \
--clip-norm 0.0 --weight-decay 0.0 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--min-lr 1e-09 --update-freq 32 --keep-last-epochs 10 \
--ddp-backend=no_c10d --max-tokens 1800 \
--lr-scheduler cosine --warmup-init-lr 1e-7 --warmup-updates 10000 \
--lr-shrink 1 --max-lr 0.0005 --lr 1e-7 --min-lr 1e-9 --warmup-init-lr 1e-07 \
--t-mult 1 --lr-period-updates 70000 \
---save-dir $SAVE
# Checkpoint averaging
python ../fairseq/scripts/average_checkpoints.py --inputs $SAVE \
--num-epoch-checkpoints 10 --output "${SAVE}/checkpoint_last10_avg.pt"
# Evaluation
CUDA_VISIBLE_DEVICES=0 fairseq-generate data-bin/wmt14_en_fr --user-dir models \
--path "${SAVE}/checkpoint_best.pt" --batch-size 128 --beam 5 --remove-bpe --lenpen 0.9 --gen-subset test