inltk: A Python repository from dashayushman

Natural Language Toolkit for Indic Languages (iNLTK)

iNLTK aims to provide out of the box support for various NLP tasks that an application developer might need for Indic languages.

Documentation

Checkout detailed docs along with Installation instructions at https://inltk.readthedocs.io

Supported languages

Language	Code
Hindi	hi
Punjabi	pa
Sanskrit	sa
Gujarati	gu
Kannada	kn
Malayalam	ml
Nepali	ne
Odia	or
Marathi	mr
Bengali	bn
Tamil	ta
Urdu	ur
English	en

Repositories containing models used in iNLTK

Language	Repository	Dataset used for Language modeling	Perplexity of ULMFiT LM	Perplexity of TransformerXL LM	Dataset used for Classification	Classification Accuracy	Classification Kappa score	ULMFiT Embeddings visualization	TransformerXL Embeddings visualization
Hindi	NLP for Hindi	Hindi Wikipedia Articles - 172k Hindi Wikipedia Articles - 55k	34.06 35.87	26.09 34.78	Hindi Movie Reviews Dataset BBC Hindi News Dataset	61.66 79.79	42.29 73.01	Hindi Embeddings projection	Hindi Embeddings projection
Punjabi	NLP for Punjabi	Punjabi Wikipedia Articles	24.40	14.03	Punjabi News Dataset	89.17	54.5	Punjabi Embeddings projection	Punjabi Embeddings projection
Sanskrit	NLP for Sanskrit	Sanskrit Wikipedia Articles	~6	~3	Sanskrit Shlokas Dataset	84.3	76.1	Sanskrit Embeddings projection	Sanskrit Embeddings projection
Gujarati	NLP for Gujarati	Gujarati Wikipedia Articles	34.12	28.12	Gujarati News Dataset	92.4	87.9	Gujarati Embeddings projection	Gujarati Embeddings projection
Kannada	NLP for Kannada	Kannada Wikipedia Articles	70.10	61.97	Kannada News Dataset	95.9	93.04	Kannada Embeddings projection	Kannada Embeddings projection
Malayalam	NLP for Malayalam	Malayalam Wikipedia Articles	26.39	25.79	Malayalam News Dataset	94.36	91.54	Malayalam Embeddings projection	Malayalam Embeddings projection
Nepali	NLP for Nepali	Nepali Wikipedia Articles	31.5	29.3	Nepali News Dataset	98.5	97.7	Nepali Embeddings projection	Nepali Embeddings projection
Odia	NLP for Odia	Odia Wikipedia Articles	26.57	26.81	Odia News Dataset	95.52	93.02	Odia Embeddings Projection	Odia Embeddings Projection
Marathi	NLP for Marathi	Marathi Wikipedia Articles	18	17.42	Marathi News Dataset	93.55	87.50	Marathi Embeddings projection	Marathi Embeddings projection
Bengali	NLP for Bengali	Bengali Wikipedia Articles	41.2	39.3	Bengali News Dataset	93.8	92	Bengali Embeddings projection	Bengali Embeddings projection
Tamil	NLP for Tamil	Tamil Wikipedia Articles	19.80	17.22	Tamil News Dataset	96.78	95.09	Tamil Embeddings projection	Tamil Embeddings projection
Urdu	NLP for Urdu	Urdu Wikipedia Articles	13.19	12.55	Urdu News Dataset	95.28	91.58	Urdu Embeddings projection	Urdu Embeddings projection

Note: English model has been directly taken from fast.ai

Contributing

Add a new language support

If you would like to add support for language of your own choice to iNLTK, please start with checking/raising a issue here

Please checkout the steps I'd mentioned here for Telugu to begin with. They should be almost similar for other languages as well.

Improving models/using models for your own research

If you would like to take iNLTK's models and refine them with your own dataset or build your own custom models on top of it, please check out the repositories in the above table for the language of your choice. The repositories above contain links to datasets, pretrained models, classifiers and all of the code for that.

Add new functionality

If you wish for a particular functionality in iNLTK - Start by checking/raising a issue here

What's next

..and being worked upon

Shout out if you want to help :)

Add Telugu and Maithili support

..and NOT being worked upon

Shout out if you want to lead :)

Add NER support for all languages
Add Textual Entailment support for all languages
Work on a unified model for all the languages
POS support in iNLTK
Add translations - to and from languages in iNLTK + English

iNLTK's Appreciation

By Jeremy Howard on Twitter
By Sebastian Ruder on Twitter
By Vincent Boucher on LinkedIn
By Kanimozhi, By Soham, By Imaad on LinkedIn
iNLTK was trending on GitHub in May 2019

dashayushman/inltk