Theano implementation of the models described in the paper Fully Character-Level Neural Machine Translation without Explicit Segmentation.
We present code for training and decoding four different models:
- bilingual bpe2char (from Chung et al., 2016).
- bilingual char2char
- multilingual bpe2char
- multilingual char2char
- Theano
- Numpy
- NLTK
- CUDA (we recommend using the latest version. The version 8.0 was used in all our experiments.)
- For preprocessing and evaluation, we used scripts from MOSES.
- This code is based on Subword-NMT and dl4mt-cdec.
The original WMT'15 corpora can be downloaded from here. For the preprocessed corpora used in our experiments, see below.
- WMT'15 preprocessed corpora
To obtain the pre-trained top-performing models, see below.
- Pre-trained models (6.0GB): Tarball updated on Nov 21st 2016. The CS-EN bi-char2char model in the previous tarball was not the best-performing model.
Do the following before executing train*.py
.
$ export THEANO_FLAGS=device=gpu,floatX=float32
With space permitting on your GPU, it may speed up training to use cnmem
:
$ export THEANO_FLAGS=device=gpu,floatX=float32,lib.cnmem=0.95,allow_gc=False
On a pre-2016 Titan X GPU with 12GB RAM, our bpe2char models were trained with cnmem
. Our char2char models (both bilingual and multilingual) were trained without cnmem
(due to lack of RAM).
Before executing the following, modify train*.py
such that the correct directory containing WMT15 corpora is referenced.
$ python bpe2char/train_bi_bpe2char.py -translate <LANGUAGE_PAIR>
$ python char2char/train_bi_char2char.py -translate <LANGUAGE_PAIR>
$ python bpe2char/train_multi_bpe2char.py
$ python char2char/train_multi_char2char.py
To resume training a model from a checkpoint, simply append -re_load
and -re_load_old_setting
above. Make sure the checkpoint resides in the correct directory (.../dl4mt-c2c/models
).
To train your models using your own dataset (and not the WMT'15 corpus), you first need to learn your vocabulary using build_dictionary_char.py
or build_dictionary_word.py
for char2char or bpe2char model, respectively. For the bpe2char model, you additionally need to learn your BPE segmentation rules on the source corpus using the Subword-NMT repository (see below).
Before executing the following, modify translate*.py
such that the correct directory containing WMT15 corpora is referenced.
$ export THEANO_FLAGS=device=gpu,floatX=float32,lib.cnmem=0.95,allow_gc=False
$ python translate/translate_bpe2char.py -model <PATH_TO_MODEL.npz> -translate <LANGUAGE_PAIR> -saveto <DESTINATION> -which <VALID/TEST_SET> # for bpe2char models
$ python translate/translate_char2char.py -model <PATH_TO_MODEL.npz> -translate <LANGUAGE_PAIR> -saveto <DESTINATION> -which <VALID/TEST_SET> # for char2char models
When choosing which pre-trained model to give to -model
, make sure to choose e.g. .grads.123000.npz
. The models with .grads
in their names are the optimal models and you should be decoding from those.
Remove -which <VALID/TEST_SET>
and append -source <PATH_TO_SOURCE>
.
If you choose to decode your own source file, make sure it is:
- properly tokenized (using
preprocess/preprocess.sh
). - bpe-tokenized for bpe2char models.
- Cyrillic characters should be converted to Latin for multilingual models.
Append -many
(of course, provide a path to a multilingual model for -model
).
We use the script from MOSES to compute the bleu score. The reference translations can be found in .../wmt15
.
perl preprocess/multi-bleu.perl reference.txt < model_output.txt
Clone the Subword-NMT repository.
git clone https://github.com/rsennrich/subword-nmt
Use following commands (find more information in Subword-NMT)
./learn_bpe.py -s {num_operations} < {train_file} > {codes_file}
./apply_bpe.py -c {codes_file} < {test_file}
$ python preprocess/iso.py russian_source.txt
will produce an output at russian_source.txt.iso9
.
@article{Lee:16,
author = {Jason Lee and Kyunghyun Cho and Thomas Hofmann},
title = {Fully Character-Level Neural Machine Translation without Explicit Segmentation},
year = {2016},
journal = {arXiv preprint arXiv:1610.03017},
}