DeepTrans is a character level language model for transliterating English text into Hindi. It is based on the attention mechanism presented in [1] and its implementation in Tensorflow's Sequence to Sequence models. This project has been inspired by the translation model presented in tensorflow's sequence to sequence model. This project comes with a pretrained model (2 layers with 256 units each) for Hindi but can be easily trained over the existing model or from scratch. The pretrained models are trained on lowercase words. If you wish to train your own model then feel free to do whatever you want and I would be glad if you could share your results and models with me. I hope to see interesting results.
- Tensorflow (Version >= 0.9)
- Python 2.7
I have tested it on an Ubuntu 15.04 with NVIDIA GeForce GT 740M Graphics card with Tensorflow running in a virtual environment. It should ideally run smoothly on any other system with tensorflow installed in it.
git clone https://github.com/dashayushman/deep-trans.git
python transliterate.py --self_test
This will generate a fake model (2 layers 32 units per layer) with fake data and trains it for 5 steps.
If the code returns without any errors, proceed to the next step.
- Download the pre-trained model from here and extract the model files to any folder in your system. The folder structure for models looks something like the following,
trained_model
|_version_1.0
|_model_12_09_2016.zip
|_model_12_09_2016.tar
|_version_0.1
|_model_9_08_2016.zip
|_model_9_08_2016.tar
- Download the vocabulary from here and extract the vocabulary files to any folder in your system. The folder structure for vocabulary looks something like the following,
vocabulary
|_version_1.0
|_vocab_12_09_2016.zip
|_vocab_12_09_2016.tar
|_version_0.1
|_vocab_9_08_2016.zip
|_vocab_9_08_2016.tar
The pretrained models and vocabularies are versioned with a date attached to the name of the compressed files. Downloading the latest version is recommended. You will find both .tar and .zip files in the download link. Both of them have the same model so you can download any one. Make sure that your model and vocabulary date and version match.
Execute the following command from your commandline to load the pre-trained models and enter an interactive mode where you can input english strings in the standard input and check results there itself.
python transliterate.py --data_dir <path_to_vocabulary_directory> --train_dir <path_to_models_directory> --decode
Your commandline should have something like this
You can enter your 'English word' after the '>' in the command like and hit enter to see results.
Execute the following command from your commandline to load the pre-trained models and transliterate an entire file.
Make sure your file contains one english word per line and is named 'test.en'
python transliterate.py --data_dir <path_to_vocabulary_directory> --train_dir <path_to_models_directory> --transliterate_file --transliterate_file_dir <path_to_directory_that_contains_test.en>
If you get a 'done generating the output file!!!' message on your commandline, then you are good to go. You will find a 'results.txt' file in your 'transliterate_file_dir'
- Training and development files: You will need two set of files for training your own model.
- Training Files: You would need two training files with file names 'train.rel.2.en' and 'train.rel.2.hn'. The 'train.rel.2.en' should contain all the english words for training with one word per line and each character separated by a space. Similarly 'train.rel.2.hn' should contain corresponding hindi words for the english words in 'train.rel.2.en' with one word per line and each character separated by a space. Make sure that the English and Hindi words correspond otherwise you will end up training a very messy model.
- Development Files: You would need two development files with file names 'test.rel.2.en' and 'test.rel.2.hn'. The 'test.rel.2.en' should contain english words for validation with one word per line and each character separated by a space. Similarly 'test.rel.2.hn' should contain corresponding hindi words for the english words in 'train.rel.2.en' with one word per line and each character separated by a space. Make sure that the English and Hindi words correspond.
- Try not to overlap the development set and training set.
- Keep these files in a directory.
- Very Important Point To Note: Due to the Character encoding issues in python 2.7 I have to put these restrictions on formatting the data (adding spaces between every character in a word). I will soon release another version with Python3+ support and solve this encoding issue and remove this weird data formatting restriction.
- This is how the data files should look like:
Once you have the above files in a directory, execute the following command to start training your own model.
python transliterate.py --data_dir <path_to_directory_with_training_and_development_files> --train_dir <path_to_a_directory_to_save_checkpoints> --size=2<number_units_per_layer> --num_layers=<number_of_layers> --steps_per_checkpoint=<number_of_steps_to_save_a_checkpoint>
The following is a real example of the above,
python transliterate.py --data_dir /home/ayushman/projects/transliterate/train_test_data/ --train_dir /home/ayushman/projects/transliterate/chkpnts/ --size=1024 --num_layers=5 --steps_per_checkpoint=1000
The following is a list of available flags that you can set for changing the model parameters.
FLAG | VALUE TYPE | DEFAULT VALUE | DESCRIPTION |
---|---|---|---|
learning_rate | Float | 0.001 | Learning rate for backpropagation through time. |
learning_rate_decay_factor | Float | 0.99 | Learning rate decays by this much. |
max_gradient_norm | Float | 5.0 | Clip gradients to this norm. |
batch_size | Integer | 10 | Batch size to use during training. |
size | Integer | 256 | Size of each model layer. |
num_layers | Integer | 2 | Number of layers in the model. |
en_vocab_size | Integer | 40000 | English vocabulary size. |
hn_vocab_size | Integer | 40000 | Hindi vocabulary size. |
data_dir | String(path) | /tmp | Data directory |
transliterate_file_dir | String(path) | /tmp | Data directory |
train_dir | String(path) | /tmp | Training directory (to save checkpoints or models). |
max_train_data_size | Integer | 0 | Limit on the size of training data (0: no limit). |
steps_per_checkpoint | Integer | 200 | How many training steps to do per checkpoint. |
decode | Boolean | False | et to True for interactive decoding. |
transliterate_file | Boolean | False | Set to True for transliterating a file. |
self_test | Boolean | False | Run a self-test if this is set to True. |