cEnTam-Dataset

This Repository contains parallel corpus for English-Tamil Language Pair. The data provided is a collectionof sentences taken from textbooks, bilingual novels, storybooks and bilingual websites that includes tourism, health and news domains. The dataset is provided as .xml file in unicode format.

Bilingual sentences - 56495

Monolingual

English sentences- 457396

Tamil sentences -563568

Kindly cite the below-given paper if you use our dataset.

@inproceedings{jp-etal-2020-centam, title = "c{E}n{T}am: Creation and Validation of a New {E}nglish-{T}amil Bilingual Corpus", author = "JP, Sanjanasri and B, Premjith and Menon, Vijay Krishna and KP, Soman", booktitle = "Proceedings of the 13th Workshop on Building and Using Comparable Corpora", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://www.aclweb.org/anthology/2020.bucc-1.10", pages = "61--64", language = "English",

}

If you need Monolingual dataset, do mail us sanjanashree@gmail.com, vijaykrishnamenon@gmail.com

sanjanasri/cEnTam-Dataset

cEnTam-Dataset

Monolingual