The IndoWordnet Parallel Corpus

IndoWordnet is a linked structure of wordnets of major Indian languages from Indo-Aryan, Dravidian and Sino-Tibetan families. Synsets are linked across many languages. Every synset in every language contains a gloss and example usage sentence/phrase. In a large number of cases, the example and gloss sentences across languages are translations. Hence, IndoWordNet is a source of parallel corpora across multiple Indian languages.

The corpus contains about 6.3 million parallel segments across 18 Indian languages from 3 languages families.

NEWS! WMT 2020 is using this corpus for the shared task on similar language translation

Documentation

You can read more about the corpus in this document: pdf

Download the corpus

You can download the corpus HERE

Version History

v0.2 (14 May 2020): Bug fixes to address problems with extraction in v0.1.
v0.1 (25 March 2020): Initial release (BUGGY: don't use this version, use v0.2)

License

This dataset is released under the Creative Commons Attribution Share Alike 4.0 International license.

Citing this dataset

If you use this dataset, please include the following citation:

@misc{kunchukuttan2020iwnparallel,
author = "Anoop Kunchukuttan",
title = "IndoWordnet Parallel Corpus",
year = "2020",
howpublished={\url{https://github.com/anoopkunchukuttan/indowordnet_parallel}}
}

We would like to hear from you if:

You are using our resources. Please let us know how you are putting these resources to use.
You have any feedback on these resources.