Fully Character-Level Neural Machine Translation

Theano implementation of the models described in the paper Fully Character-Level Neural Machine Translation without Explicit Segmentation.

We present code for training and decoding four different models:

bilingual bpe2char (from Chung et al., 2016).
bilingual char2char
multilingual bpe2char
multilingual char2char

Dependencies

Python

Theano
Numpy
NLTK

GPU

CUDA (we recommend using the latest version. The version 8.0 was used in all our experiments.)

Related code

For preprocessing and evaluation, we used scripts from MOSES.
This code is based on Subword-NMT and dl4mt-cdec.

Downloading Datasets & Pre-trained Models

The original WMT'15 corpora can be downloaded from here. For the preprocessed corpora used in our experiments, see below.

WMT'15 preprocessed corpora
- Standard version (for bilingual models, 3.5GB)
- Cyrillic converted to Latin (for multilingual models, 2.6GB)

To obtain the pre-trained top-performing models, see below.

Pre-trained models (6.0GB): Tarball updated on Nov 21st 2016. The CS-EN bi-char2char model in the previous tarball was not the best-performing model.

Training Details

Using GPUs

Do the following before executing train*.py.

$ export THEANO_FLAGS=device=gpu,floatX=float32

With space permitting on your GPU, it may speed up training to use cnmem:

$ export THEANO_FLAGS=device=gpu,floatX=float32,lib.cnmem=0.95,allow_gc=False

On a pre-2016 Titan X GPU with 12GB RAM, our bpe2char models were trained with cnmem. Our char2char models (both bilingual and multilingual) were trained without cnmem (due to lack of RAM).

Training models

Before executing the following, modify train*.py such that the correct directory containing WMT15 corpora is referenced.

Bilingual bpe2char

$ python bpe2char/train_bi_bpe2char.py -translate <LANGUAGE_PAIR>

Bilingual char2char

$ python char2char/train_bi_char2char.py -translate <LANGUAGE_PAIR>

Multilingual bpe2char

$ python bpe2char/train_multi_bpe2char.py

Multilingual char2char

$ python char2char/train_multi_char2char.py

Checkpoint

To resume training a model from a checkpoint, simply append -re_load and -re_load_old_setting above. Make sure the checkpoint resides in the correct directory (.../dl4mt-c2c/models).

Using Custom Datasets

To train your models using your own dataset (and not the WMT'15 corpus), you first need to learn your vocabulary using build_dictionary_char.py or build_dictionary_word.py for char2char or bpe2char model, respectively. For the bpe2char model, you additionally need to learn your BPE segmentation rules on the source corpus using the Subword-NMT repository (see below).

Decoding

Decoding WMT'15 validation / test files

Before executing the following, modify translate*.py such that the correct directory containing WMT15 corpora is referenced.

$ export THEANO_FLAGS=device=gpu,floatX=float32,lib.cnmem=0.95,allow_gc=False
$ python translate/translate_bpe2char.py -model <PATH_TO_MODEL.npz> -translate <LANGUAGE_PAIR> -saveto <DESTINATION> -which <VALID/TEST_SET> # for bpe2char models
$ python translate/translate_char2char.py -model <PATH_TO_MODEL.npz> -translate <LANGUAGE_PAIR> -saveto <DESTINATION> -which <VALID/TEST_SET> # for char2char models

When choosing which pre-trained model to give to -model, make sure to choose e.g. .grads.123000.npz. The models with .grads in their names are the optimal models and you should be decoding from those.

Decoding an arbitrary file

Remove -which <VALID/TEST_SET> and append -source <PATH_TO_SOURCE>.

If you choose to decode your own source file, make sure it is:

properly tokenized (using preprocess/preprocess.sh).
bpe-tokenized for bpe2char models.
Cyrillic characters should be converted to Latin for multilingual models.

Decoding multilingual models

Append -many (of course, provide a path to a multilingual model for -model).

Evaluation

We use the script from MOSES to compute the bleu score. The reference translations can be found in .../wmt15.

perl preprocess/multi-bleu.perl reference.txt < model_output.txt

Extra

Extracting & applying BPE rules

Clone the Subword-NMT repository.

git clone https://github.com/rsennrich/subword-nmt

Use following commands (find more information in Subword-NMT)

./learn_bpe.py -s {num_operations} < {train_file} > {codes_file}
./apply_bpe.py -c {codes_file} < {test_file}

Converting Cyrillic to Latin

$ python preprocess/iso.py russian_source.txt

will produce an output at russian_source.txt.iso9.

Citation

@article{Lee:16,
  author    = {Jason Lee and Kyunghyun Cho and Thomas Hofmann},
  title     = {Fully Character-Level Neural Machine Translation without Explicit Segmentation},
  year      = {2016},
  journal   = {arXiv preprint arXiv:1610.03017},
}

jasonray716/dl4mt-c2c