OPUS the open parrallel corpus
A Dravidian Etymological Dictionary
Byte Pair Encoding - Pretrained for 275 language
FastText word vectors for 157 languages
Indian Language Technology Proliferation and Deployment Center
Center For Indian Language Technology - CFILT FB page
Indian Institute of Language Studies (IILS)
Central Institute of Indian Languages
Central Institute of Indian Languages
Survey:Natural Language Parsing For Indian Languages
mlmorph - Malayalam Morphological Analyzer using Finite State Transducer
Open Tamil Suite of tools for operating on tamil text.
Text Classification model in Pytorch: Can be easily applied to other datasets, infact the linked repository also contains a dataset for film reviews in tamil.
- Contains Wikipedia Articles Dataset (72,374 articles) and scripts which were used to scrape Wikipedia and clean that dataset
- Contains Language Model with Perplexity ~41
- Contains Bengali News Classification Model with 94% accuracy
Research Papers in Bengali NLP
Language | Repository | Perplexity of Language model | Wikipedia Articles Dataset | Classification accuracy | Classification Kappa score |
---|---|---|---|---|---|
Hindi | NLP for Hindi | ~36 | 55,000 articles | ~79 (News Classification) | ~30 (Movie Review Classification) |
Punjabi | NLP for Punjabi | ~13 | 44,000 articles | ~89 (News Classification) | ~60 (News Classification) |
Sanskrit | NLP for Sanskrit | ~6 | 22,273 articles | ~70 (Shloka Classification) | ~56 (Shloka Classification) |
Gujarati | NLP for Gujarati | ~34 | 31,913 articles | ~91 (News Classification) | ~85 (News Classification) |
Kannada | NLP for Kannada | ~70 | 32,997 articles | ~94 (News Classification) | ~90 (News Classification) |
Malyalam | NLP for Malyalam | ~26 | 12,388 articles | ~94 (News Classification) | ~91 (News Classification) |
Nepali | NLP for Nepali | ~32 | 38,757 articles | ~97 (News Classification) | ~96 (News Classification) |
Odia | NLP for Odia | ~27 | 17,781 articles | ~95 (News Classification) | ~92 (News Classification) |
Marathi | NLP for Marathi | ~18 | 85,537 articles | ~91 (News Classification) | ~84 (News Classification) |
Bengali | NLP for Bengali | ~41 | 72,374 articles | ~94 (News Classification) | ~92 (News Classification) |