DeepTrans is a character level language model for transliterating English text into Hindi. It is based on the attention mechanism presented in [1] and its implementation in Tensorflow's Sequence to Sequence models. This project has been inspired by the translation model presented in tensorflow's sequence to sequence model. This project comes with a pretrained model (2 layers with 256 units each) for Hindi but can be easily trained over the existing model or from scratch. The pretrained models are trained on lowercase words. If you wish to train your own model then feel free to do whatever you want and I would be glad if you could share your results and models with me. I hope to see interesting results.

Prerequisites

Tensorflow (Version >= 0.9)
Python 2.7

I have tested it on an Ubuntu 15.04 with NVIDIA GeForce GT 740M Graphics card with Tensorflow running in a virtual environment. It should ideally run smoothly on any other system with tensorflow installed in it.

Installation and Setup

Clone Repository

git clone https://github.com/dashayushman/deep-trans.git

Run Tests

python transliterate.py --self_test

This will generate a fake model (2 layers 32 units per layer) with fake data and trains it for 5 steps.
If the code returns without any errors, proceed to the next step.

Download The Model and Vocabulary

Download the pre-trained model from here and extract the model files to any folder in your system. The folder structure for models looks something like the following,

trained_model
    |_version_1.0
            |_model_12_09_2016.zip
            |_model_12_09_2016.tar
    |_version_0.1
            |_model_9_08_2016.zip
            |_model_9_08_2016.tar

Download the vocabulary from here and extract the vocabulary files to any folder in your system. The folder structure for vocabulary looks something like the following,

vocabulary
    |_version_1.0
            |_vocab_12_09_2016.zip
            |_vocab_12_09_2016.tar
    |_version_0.1
            |_vocab_9_08_2016.zip
            |_vocab_9_08_2016.tar

The pretrained models and vocabularies are versioned with a date attached to the name of the compressed files. Downloading the latest version is recommended. You will find both .tar and .zip files in the download link. Both of them have the same model so you can download any one. Make sure that your model and vocabulary date and version match.

Load and Run

Loading the model

Execute the following command from your commandline to load the pre-trained models and enter an interactive mode where you can input english strings in the standard input and check results there itself.

python transliterate.py --data_dir <path_to_vocabulary_directory> --train_dir <path_to_models_directory> --decode

Your commandline should have something like this

You can enter your 'English word' after the '>' in the command like and hit enter to see results.

Transliterate a file

Execute the following command from your commandline to load the pre-trained models and transliterate an entire file.
Make sure your file contains one english word per line and is named 'test.en'

python transliterate.py --data_dir <path_to_vocabulary_directory> --train_dir <path_to_models_directory> --transliterate_file --transliterate_file_dir <path_to_directory_that_contains_test.en>

If you get a 'done generating the output file!!!' message on your commandline, then you are good to go. You will find a 'results.txt' file in your 'transliterate_file_dir'

Train Your Own Model

Requirements

Training and development files: You will need two set of files for training your own model.

Training Files: You would need two training files with file names 'train.rel.2.en' and 'train.rel.2.hn'. The 'train.rel.2.en' should contain all the english words for training with one word per line and each character separated by a space. Similarly 'train.rel.2.hn' should contain corresponding hindi words for the english words in 'train.rel.2.en' with one word per line and each character separated by a space. Make sure that the English and Hindi words correspond otherwise you will end up training a very messy model.
Development Files: You would need two development files with file names 'test.rel.2.en' and 'test.rel.2.hn'. The 'test.rel.2.en' should contain english words for validation with one word per line and each character separated by a space. Similarly 'test.rel.2.hn' should contain corresponding hindi words for the english words in 'train.rel.2.en' with one word per line and each character separated by a space. Make sure that the English and Hindi words correspond.

Try not to overlap the development set and training set.
Keep these files in a directory.
Very Important Point To Note: Due to the Character encoding issues in python 2.7 I have to put these restrictions on formatting the data (adding spaces between every character in a word). I will soon release another version with Python3+ support and solve this encoding issue and remove this weird data formatting restriction.
This is how the data files should look like:

Training

Once you have the above files in a directory, execute the following command to start training your own model.

python transliterate.py --data_dir <path_to_directory_with_training_and_development_files> --train_dir <path_to_a_directory_to_save_checkpoints> --size=2<number_units_per_layer> --num_layers=<number_of_layers> --steps_per_checkpoint=<number_of_steps_to_save_a_checkpoint>

The following is a real example of the above,

python transliterate.py --data_dir /home/ayushman/projects/transliterate/train_test_data/ --train_dir /home/ayushman/projects/transliterate/chkpnts/ --size=1024 --num_layers=5 --steps_per_checkpoint=1000

FLAGS

The following is a list of available flags that you can set for changing the model parameters.

FLAG	VALUE TYPE	DEFAULT VALUE	DESCRIPTION
learning_rate	Float	0.001	Learning rate for backpropagation through time.
learning_rate_decay_factor	Float	0.99	Learning rate decays by this much.
max_gradient_norm	Float	5.0	Clip gradients to this norm.
batch_size	Integer	10	Batch size to use during training.
size	Integer	256	Size of each model layer.
num_layers	Integer	2	Number of layers in the model.
en_vocab_size	Integer	40000	English vocabulary size.
hn_vocab_size	Integer	40000	Hindi vocabulary size.
data_dir	String(path)	/tmp	Data directory
transliterate_file_dir	String(path)	/tmp	Data directory
train_dir	String(path)	/tmp	Training directory (to save checkpoints or models).
max_train_data_size	Integer	0	Limit on the size of training data (0: no limit).
steps_per_checkpoint	Integer	200	How many training steps to do per checkpoint.
decode	Boolean	False	et to True for interactive decoding.
transliterate_file	Boolean	False	Set to True for transliterating a file.
self_test	Boolean	False	Run a self-test if this is set to True.

References

Sequence to Sequence Learning with Neural Networks