This is a repository for the final project in HSE Deep Learning in NLP Course. The goal of the project is to develop a machine translation system that comes close in quality to an existing system. The project was mainly focused on translating from Russian to Ukrainian, however it is language-agnostic and, in theory, should work with any language pair. This particular pair was chosen mainly due to the availability of a reference BLEU score obtained by a system using modern architecture, and my personal ability to manually gauge the quality of translations, as well as the possibility of getting some linguistic insight on the intricacies of neural machine translation.
A report (in Russian) is available here.
Note: to reproduce the results locally you'll need a Unix-like system and a CUDA-compatible GPU. Alternatively, you can utilize Google Colaboratory (as in an example provided here).
In any case, first you'll need to use the following commands:
git clone https://github.com/slowwavesleep/NeuralMachineTranslation.git
cd NeuralMachineTranslation
Then install dependencies.
pip install -r requirements.txt
Or you can use colab_helper.sh
to install dependencies and get training data.
bash colab_helper.sh
To train the main model of this project run this:
python train.py main_config.yml
Yaml config file contains all the necessary parameters to train the model, as well as the paths specifying where to get the data and where to save results and models.
To evaluate the results of a model run the following command:
python test.py results/main/translations.txt
Where the only argument specifies the location of a file to evaluate.
The translations done by models in this project can be found here.
BLEU score evaluations are presented below.
Model | Num. of examples | BLEU |
---|---|---|
Baseline | 100000 | 2.1 |
Main | 100000 | 32.94 |
Main | 800000 | 49.55 |
Data for training and evaluation is taken from Tatoeba Challenge. Model results for this data are available here. More specifically, the rus-ukr pair.
The models may very well be used with other language pairs from other resources. There are a few tools that can help with that.
If you have data in tab delimited format, you can use prepare_anki.py
to convert into two separate files
assuming that the first and second elements in each line correspond to source and target sentences respectively.
If you have aligned sentences stored in two files, you can use split_data.py
to conveniently split your data into
train, dev, and test parts, which then can be used to train a new model, provided you specify correct file paths in yaml
configuration file.