Sumerian - English Machine Translation (GSoC - 2020)
As part of the MTAAC project, the organization host Sumerian data comprising 1.5 million transliteration lines and 10K parallel lines corpus (approx). We already developed a neural network-based encode-decoder architecture for English-Sumerian Machine Translation, but that leverages the parallel dataset only, which is not sufficient to achieve state of the art results. Your task is to develop a language model using the monolingual data as well as parallel data to translate Sumerian phrases to English, and vice versa.
Possible Mentors:
- Niko Schenk
- Ravneet Punia
Link for the Dataset: TODO
Your Tasks & Desired (Minimum) Outcomes:
Train and evaluate different models and architectures on standard train/development/test splits. Experiment with all possible hyperparameter settings to obtain the best performance. Perform a quantitative and qualitative evaluation of the translations. Better accuracy than the previous year model. Testing different at least two NMT approaches like Cross-lingual Language Model, Dual Learning or Back-Translation. Students with a research background will be preferred.
Getting started:
- Cross-lingual Language Model
- Dual Learning
- Back-Translation for Unsupervised NMT