/convtransformer

Code for the ACL2020 paper Character-Level Translation with Self-Attention

Primary LanguagePython

Character-Level Translation with Self-attention

Code for the paper Character-Level Translation with Self-attention, accepted at ACL 2020.

Corpora and experiments

We test our model on two corpora:

Preparations

Download our code and install our version of Fairseq

We use fairseq (Ott et al. in 2019) as a base to implement our model. To install our fairseq snapshot, run the following commands:

git clone https://github.com/CharizardAcademy/convtransformer.git
pip install -r requirements.txt # install dependencies
python setup.py build # build fairseq
python setup.py develop

To make Fairseq work on the character-level, we modify tokenizer.py here.

Data preprocessing

We use Moses (Koehn et al. in 2007) to clean and tokenize the data, by appying the following scripts:

mosesdecoder/scripts/tokenizer/remove-non-printing-char.perl
mosesdecoder/scripts/tokenizer/tokenizer.perl
mosesdecoder/scripts/training/clean-corpus-n.perl

Converting Chinese texts to Wubi texts

To convert a text of raw Chinese characters into a text of corresponding Wubi codes, run the following commands:

cd convtransformer/

python convert_text.py --input-doc path/to/the/chinese/text --output-doc path/to/the/wubi/text --convert-type ch2wb

The convert_text.py is available at https://github.com/duguyue100/wmt-en2wubi.

Bilingual Training Data

To construct training sets for bilingual translation, run the following commands (example for UNPC French - English):

cd UN-corpora/
cd ./en-fr

paste -d'|' UNv1.0.en-fr.fr UNv1.0.en-fr.en | cat -n |shuf -n 1000000 | sort -n | cut -f2 > train.parallel.fr-en


cut -d'|' -f1 train.parallel.fr-en > 1mil.train.fr-en.fr
cut -d'|' -f2 train.parallel.fr-en > 1mil.train.fr-en.en

Multilingual Training Data

To construct training sets for multilingual translation, run the following commands (example for UNPC French + Spanish - English):

cat train.parallel.fr-en train.parallel.es-en > concat.train.parallel.fres-en

shuf concat.train.parallel.fres-en > shuffled.train.parallel.fres-en

cut -d'|' -f1 shuffled.train.parallel.fres-en > 2mil.train.fres-en.fres
cut -d'|' -f2 shuffled.train.parallel.fres-en > 2mil.train.fres-en.en

Data Binarization

The next step is binarize the data. Example for UNPC French + Spanish - English:

mkdir UN-bin/multilingual/fres-en/test-fr/
mkdir UN-bin/multilingual/fres-en/test-es/

cd convtransformer/

evaluation on French input

python preprocess.py --source-lang fres --target-lang en \
--trainpref UN-processed/multilingual/fres-en/test-fr/2mil.train.fres-en/ \
--validpref UN-processed/multilingual/fres-en/test-fr/2mil.valid.fres-en/ \
--testpref UN-processed/multilingual/fres-en/test-fr/2mil.test.fres-en.fr/ \
--destdir UN-bin/multilingual/fres-en/test-fr/ \ 
--nwordssrc 10000 --nwordstgt 10000 

evaluation on Spanish input

python preprocess.py --source-lang fres --targe-lang en \
--trainpref UN-processed/multilingual/fres-en/test-es/2mil.train.fres-en/ \
--validpref UN-processed/multilingual/fres-en/test-es/2mil.valid.fres-en/ \
--testpref UN-processed/multilingual/fres-en/test-es/2mil.test.fres-en.es/ \
--destdir UN-bin/multilingual/fres-en/test-es/ \
--nwordssrc 10000 --nwordstgt 10000

Convtransformer model

The model is implemented here.

Training

We train our models on 4 NVIDIA 1080x GPUs, using Adam:

CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py UN-bin/multilingual/fres-en/test-es/ \
--arch convtransformer --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 --lr 0.0001 \
--min-lr 1e-09 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--weight-decay 0.0 --max-tokens 3000  \
--save-dir checkpoints-conv-multi-fres-en/ \
--no-progress-bar --log-format simple --log-interval 2000 \
--find-unused-parameters --ddp-backend=no_c10d

where --ddp-backend=no_c10d and --find-unused-parameters are crucial arguments to train the convtransformer model. You should change CUDA_VISIBLE_DEVICES according to the hardware you have available.

Inference

We compute BLEU using Moses.

Evaluation on test set

As an example, to evaluate the test set, run conv-multi-fres-en.sh to generate translation files of each individual checkpoint. To compute the BLEU score of one translation file, run:

cd geneations/conv-multi-fres-en/
cd ./test-fr/

bash geneation_split.sh

rm -f generation_split.sh.sys generation_split.sh.ref 

mkdir split

mv generate*.out.sys ./split/
mv generate*.out.ref ./split/

cd ./split/

perl multi-bleu.perl generate30.out.ref < generate30.out.sys

Evaluation on with manual input

To generate translation by manually inputting the sentence, run:

cd convtransformer/

python interactive.py -source_sentence "Violación: uso de cloro gaseoso por el régimen sirio." \ 
-path_checkpoint "checkpoints-conv-multi-fres-en/checkpoint30.pt" \
-data_bin "UN-bin/multilingual/fres-en/test-es/"

This will print out the translated sentence in the terminal.

Analysis

Canonical Correlation Analysis

We compute the correlation coefficients with the CCA algorithm using the encoder-decoder attention matrix from the 6.th last model layer.

An an example, to obtain the attention matrices, run:

cd convtransformer/ 

bash attn_matrix.sh

To compute the correlation coefficients, run:

python cca.py -path_X "/bilingual/attention/matrix/" -path_Y "/multilingual/attention/matrix/"

Citation

@inproceedings{gao2020character,
  title={Character-level {T}ranslation with {S}elf-attention},
  author={Yingqiang Gao and Nikola I. Nikolov and Yuhuang Hu and Richard H.R. Hahnloser},
  booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
  publisher = "Association for Computational Linguistics",
  year={2020}
}