/TV-distinction

Project for honorifics control (T/V) for Russian-English translation.

Primary LanguageJupyter Notebook

TV-distinction

This project is aiming to implement honorifics (T/V) distinction for translating English to Russian.

It is inspired by an article 'Controlling Politeness in Neural Machine Translation via Side Constraints'.

This README provides a short overview of the project, for a lengthy one, please, read this report.

Repository

The repo consists of

  • token-based (code/tv_detector/) and grammar-based (code/conll_tv_detector/) T/V detectors

  • neural-based translation, evaluation (BLEURT) and morphosyntactic parsing (DeepPavlov) models in Jupiter Notebook format (under code/notebooks/). Those models are intended to use via 'Google Collaboratory'.

  • data processing utilities (code/helper.py) and some examples (code/main.py)

  • train and test corpora (under data/)

  • predicted translations (under translations/)

Data Sources

Data for training a neural model is taken from the Yandex 1m EN-RU corpus. Dataset was sampled to select 22k V-sentences, 8k T-sentences and 100k neutral sentences.

Test dataset was crafted from manually annotated sources for solving the deixis problem (Voita et. al, 2019).

Model

The model was developed with the JoeyNMT as a base translation framework. Main notebooks for model training and demonstration are code/notebooks/train_TV_model.ipynb and code/notebooks/demo_TV_model.ipynb Trained checkpoints and some data files for demo are available on Google Drive (TV_model, base_model).

Results

T/V control examples:

Evaluation results:

To sum up, you can see that a simple technique such as prepending T/V tokens to source sentences can add controllability to NMT.


Author: Tsimafei Prakapenka