/joint

Joint Source-Target Self Attention with Locality Constraints

Primary LanguagePython

Joint Source-Target Self Attention with Locality Constraints (Fonollosa et al., 2019)

This repository contains the source code, pre-trained models, as well as instructions to reproduce the results or our paper

Citation:

@article{fonollosa2019joint,
  title={Joint Source-Target Self Attention with Locality Constraints},
  author={Jos\'e A. R. Fonollosa and Noe Casas and Marta R. Costa-juss\`a},
  journal={arXiv preprint arXiv:1905.06596},
  url={http://arxiv.org/abs/1905.06596}
  year={2019}
}

Setup

Requirements

  • PyTorch version >= 1.0.0
  • fairseq version >= 0.6.2
  • Python version >= 3.6
  • For training new models, you'll also need an NVIDIA GPU and NCCL

Install fairseq from source

git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable .
cd ..

Clone this repository

git clone https://github.com/jarfo/joint.git
cd joint

Translation

Pre-trained models

Dataset Model Prepared test set
IWSLT14 German-English download (.pt) IWSLT14 test: download (.tgz)
WMT16 English-German download (.bz2) newstest2014 (shared vocab): download (.tgz)
WMT14 English-French download (split #1)
download (split #2)
newstest2014 (shared vocab): download (.tgz)

The English-French model download is split in two files that can be joined with:

cat local_joint_attention_wmt_en_fr_big.pt_* > local_joint_attention_wmt_en_fr_big.pt

IWSLT14 De-En

The IWSLT'14 German to English translation database "Report on the 11th IWSLT evaluation campaign" by Cettolo et al. is tokenized with a joint BPE with 31K tokens

# Dataset download and preparation
cd examples
./prepare-iwslt14-31K.sh
cd ..

# Dataset binarization:
TEXT=examples/iwslt14.tokenized.31K.de-en
fairseq-preprocess --joined-dictionary --source-lang de --target-lang en \
  --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
  --destdir data-bin/iwslt14.joined-dictionary.31K.de-en

Training and evaluating Local Joint Attention on a GPU:

# Training
SAVE="checkpoints/local_joint_attention_iwslt_de_en"
mkdir -p $SAVE

fairseq-train data-bin/iwslt14.joined-dictionary.31K.de-en \
    --user-dir models \
    --arch local_joint_attention_iwslt_de_en \
    --clip-norm 0 --optimizer adam --lr 0.001 \
    --source-lang de --target-lang en --max-tokens 4000 --no-progress-bar \
    --log-interval 100 --min-lr '1e-09' --weight-decay 0.0001 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --lr-scheduler inverse_sqrt \
    --ddp-backend=no_c10d \
    --max-update 85000 --warmup-updates 4000 --warmup-init-lr '1e-07' \
    --adam-betas '(0.9, 0.98)' --adam-eps '1e-09' --keep-last-epochs 10 \
    --arch local_joint_attention_iwslt_de_en --share-all-embeddings \
    --save-dir $SAVE

python scripts/average_checkpoints.py --inputs $SAVE \
    --num-epoch-checkpoints 10 --output "${SAVE}/checkpoint_last10_avg.pt"

# Evaluation
fairseq-generate data-bin/iwslt14.joined-dictionary.31K.de-en --user-dir models \
    --path "${SAVE}/checkpoint_last10_avg.pt" \
    --batch-size 32 --beam 5 --remove-bpe --lenpen 1.7 --gen-subset test --quiet

WMT16 En-De

Training Local Joint Attention on WMT16 En-De using cosine scheduler on one machine with 8 Nvidia V100-16GB GPUs:

Download the preprocessed WMT'16 En-De data provided by Google. Then

Extract the WMT'16 En-De data:

$ TEXT=wmt16_en_de_bpe32k
$ mkdir $TEXT
$ tar -xzvf wmt16_en_de.tar.gz -C $TEXT

Preprocess the dataset with a joined dictionary:

$ fairseq-preprocess --source-lang en --target-lang de \
  --trainpref $TEXT/train.tok.clean.bpe.32000 \
  --validpref $TEXT/newstest2013.tok.bpe.32000 \
  --testpref $TEXT/newstest2014.tok.bpe.32000 \
  --destdir data-bin/wmt16_en_de_bpe32k \
  --nwordssrc 32768 --nwordstgt 32768 \
  --joined-dictionary

Train a model

# Training
SAVE="save/joint_attention_wmt_en_de_big"
mkdir -p $SAVE
python -m torch.distributed.launch --nproc_per_node 8 fairseq-train \
    data-bin/wmt16_en_de_bpe32k \
    --user-dir models \
    --arch local_joint_attention_wmt_en_de_big \
    --fp16 --log-interval 100 --no-progress-bar \
    --max-update 30000 --share-all-embeddings --optimizer adam \
    --adam-betas '(0.9, 0.98)' \
    --clip-norm 0.0 --weight-decay 0.0 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --min-lr 1e-09 --update-freq 32 --keep-last-epochs 10 \
    --ddp-backend=no_c10d --max-tokens 1800 \
    --lr-scheduler cosine --warmup-init-lr 1e-7 --warmup-updates 10000 \
    --lr-shrink 1 --max-lr 0.0009 --lr 1e-7 --min-lr 1e-9 --warmup-init-lr 1e-07 \
    --t-mult 1 --lr-period-updates 20000 \
    --save-dir $SAVE

# Checkpoint averaging
python ../fairseq/scripts/average_checkpoints.py --inputs $SAVE \
    --num-epoch-checkpoints 10 --output "${SAVE}/checkpoint_last10_avg.pt"

# Evaluation on newstest2014
CUDA_VISIBLE_DEVICES=0 fairseq-generate $DATA --user-dir models \
    --path "${SAVE}/checkpoint_last10_avg.pt" \
    --batch-size 32 --beam 5 --remove-bpe --lenpen 0.35 --gen-subset test > wmt16_gen.txt
bash ../fairseq/scripts/compound_split_bleu.sh wmt16_gen.txt

WMT14 En-Fr

Training and evaluating Local Joint Attention on WMT14 En-Fr using cosine scheduler on one machine with 8 Nvidia V100-16GB GPUs:

# Data preparation
$ cd examples
$ bash prepare-wmt14en2fr.sh
$ cd ..

# Binarize the dataset:
$ TEXT=examples/wmt14_en_fr
$ fairseq-preprocess --joined-dictionary --source-lang en --target-lang fr \
  --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
  --destdir data-bin/wmt14_en_fr --thresholdtgt 0 --thresholdsrc 0

# Training
SAVE="save/dynamic_conv_wmt14en2fr"
mkdir -p $SAVE
python -m torch.distributed.launch --nproc_per_node 8 fairseq-train \
    data-bin/wmt14_en_fr \
    --user-dir models \
    --arch local_joint_attention_wmt_en_fr_big \
    --fp16 --log-interval 100 --no-progress-bar \
    --max-update 80000 --share-all-embeddings --optimizer adam \
    --adam-betas '(0.9, 0.98)' \
    --clip-norm 0.0 --weight-decay 0.0 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --min-lr 1e-09 --update-freq 32 --keep-last-epochs 10 \
    --ddp-backend=no_c10d --max-tokens 1800 \
    --lr-scheduler cosine --warmup-init-lr 1e-7 --warmup-updates 10000 \
    --lr-shrink 1 --max-lr 0.0005 --lr 1e-7 --min-lr 1e-9 --warmup-init-lr 1e-07 \
    --t-mult 1 --lr-period-updates 70000 \
    ---save-dir $SAVE

# Checkpoint averaging
python ../fairseq/scripts/average_checkpoints.py --inputs $SAVE \
    --num-epoch-checkpoints 10 --output "${SAVE}/checkpoint_last10_avg.pt"

# Evaluation
CUDA_VISIBLE_DEVICES=0 fairseq-generate data-bin/wmt14_en_fr --user-dir models \
    --path "${SAVE}/checkpoint_best.pt" --batch-size 128 --beam 5 --remove-bpe --lenpen 0.9 --gen-subset test