Inspite of the fact that people speaking Indian languages like Hindi and Bengali occupy a large percentage of today’s population; these languages are considered low resource with onlythe IITB Hi-En corpus having more than 1 million parallel aligned sentences. And in the largest publicly available multilingual train corpus for Indian languages (as of March 2021) of PIB corpus, most of other pairs were not even crossing one lakh parallel segments. And such less amount of data would not be enough for the data hungry NMT models. So we aimed at filling this gap and improving the results for Indic Machine Translation by walking along the steps of the IITB corpus collection and researching all the different datasets available publicly and create the corpus of Boli.
The website for the corpus is hosted here. The scripts for the creation of the corpus can be found here
By Kaivalya and Vedant, Supervised by Prof Parag Singla.