/nmt

Turkish to English Neural Machine Translation

Primary LanguagePython

Neural Machine Translation

This repository implements a Turkish to English Neural Machine Translation system using Seq2Seq + Global Attention model. There is also a Flask application that you can run locally. You can enter the text, translate and, inspect the results as well as the attention visualization. We run beam search with beam size 3 in the background and return the most probable sequences sorted by their relative score.

Examples

NMT Image

Dataset

The dataset for this project is taken from here. I have used the Tatoeba corpus. I have deleted some of the duplicates found in the data. I also pretokenized the dataset. Finalized version can be found in data folder.

Tokenization

  • For tokenizing the Turkish sentences, I've used the nltk's RegexpTokenizer.
    puncts_except_apostrophe = '!"#$%&\()*+,-./:;<=>?@[\\]^_`{|}~'
    TOKENIZE_PATTERN = fr"[{puncts_except_apostrophe}]|\w+|['\w]+"
    regex_tokenizer = RegexpTokenizer(pattern=TOKENIZE_PATTERN)
    text = "Titanic 15 Nisan pazartesi saat 02:20'de battı."
    tokenized_text = regex_tokenizer.tokenize(text)
    print(" ".join(tokenized_text))
    # Output: Titanic 15 Nisan pazartesi saat 02 : 20 'de battı .
    # This splitting property on "02 : 20" is different from the English tokenizer.
    # We could handle those situations. But I wanted to keep it simple and see if 
    # the attention distribution on those words aligns with the English tokens.
    # There are similar cases mostly on dates as well like in this example: 02/09/2019
  • For tokenizing the English sentences, I've used the spacy's English model.
    en_nlp = spacy.load('en_core_web_sm')
    text = "The Titanic sank at 02:20 on Monday, April 15th."
    tokenized_text = en_nlp.tokenizer(text)
    print(" ".join([tok.text for tok in tokenized_text]))
    # Output: The Titanic sank at 02:20 on Monday , April 15th .

Format

Turkish and English sentences are expected to be in two different files.

file: train.tr
tr_sent_1
tr_sent_2
tr_sent_3
...

file: train.en
en_sent_1
en_sent_2
en_sent_3
...

Train

Please run python train.py -h for the full list of arguments.

Sample usage:

python train.py --train_data train.tr train.en --valid_data valid.tr valid.en --n_epochs 30 --batch_size 32 --embedding_dim 256 --hidden_size 256 --num_layers 2 --bidirectional --dropout_p 0.3 --device cuda

Test

To compute the corpus level blue score.

usage: test.py [-h] --model_file MODEL_FILE --valid_data VALID_DATA
               [VALID_DATA ...]

Neural Machine Translation Testing

optional arguments:
  -h, --help            show this help message and exit
  --model_file MODEL_FILE
                        Model File
  --valid_data VALID_DATA [VALID_DATA ...]
                        Validation_data


Sample Usage:
python test.py --model_file model.bin --validation_data valid.tr valid.en

Flask App

For running the application locally, run:

python app.py

Make sure that your model paths in the config.py file is properly defined.

Improvements

  • Using subword units (for both Turkish and English)
  • Different attention mechanisms (learning different parameters for the attention)

References