
Project for honorifics control (T/V) for Russian-English translation.

Primary LanguageJupyter Notebook


This project is aiming to implement honorifics (T/V) distinction for translating English to Russian.

It is inspired by an article 'Controlling Politeness in Neural Machine Translation via Side Constraints'.

This README provides a short overview of the project, for a lengthy one, please, read this report.


The repo consists of

  • token-based (code/tv_detector/) and grammar-based (code/conll_tv_detector/) T/V detectors

  • neural-based translation, evaluation (BLEURT) and morphosyntactic parsing (DeepPavlov) models in Jupiter Notebook format (under code/notebooks/). Those models are intended to use via 'Google Collaboratory'.

  • data processing utilities (code/helper.py) and some examples (code/main.py)

  • train and test corpora (under data/)

  • predicted translations (under translations/)

Data Sources

Data for training a neural model is taken from the Yandex 1m EN-RU corpus. Dataset was sampled to select 22k V-sentences, 8k T-sentences and 100k neutral sentences.

Test dataset was crafted from manually annotated sources for solving the deixis problem (Voita et. al, 2019).


The model was developed with the JoeyNMT as a base translation framework. Main notebooks for model training and demonstration are code/notebooks/train_TV_model.ipynb and code/notebooks/demo_TV_model.ipynb Trained checkpoints and some data files for demo are available on Google Drive (TV_model, base_model).


T/V control examples:

Evaluation results:

To sum up, you can see that a simple technique such as prepending T/V tokens to source sentences can add controllability to NMT.

Author: Tsimafei Prakapenka