This repository contains State of the Art Language models and Classifier for Hindi language (spoken in Indian sub-continent).
The models trained here have been used in Natural Language Toolkit for Indic Languages (iNLTK)
Architecture/Dataset | Hindi Wikipedia Articles - 172k | Hindi Wikipedia Articles - 55k |
---|---|---|
ULMFiT | 34.06 | 35.87 |
TransformerXL | 26.09 | 34.78 |
Note: Nirant has done previous SOTA work with Hindi Language Model and achieved perplexity of ~46. The scores above aren't directly comparable with his score because his train and test set were different and test set isn't available for reproducibility
Dataset | Accuracy | Kappa Score |
---|---|---|
Hindi Movie Reviews Dataset | 62.22 | 43.13 |
BBC Hindi Dataset | 79.79 | 73.01 |
Hindi Movie Reviews Dataset (with augmented data) | 68.33 | 52.25 |
Checkout this blog-post where effect of Data Augmentation on Classification Metrics of Hindi Movie Reviews Dataset has been discussed.
Architecture | Visualization |
---|---|
ULMFiT | Embeddings projection |
TransformerXL | Embeddings projection |
Architecture | Visualization |
---|---|
ULMFiT | Encodings projection |
Download pretrained Language Models of ULMFiT, TransformerXL trained on Hindi Wikipedia Articles - 172k and Hindi Wikipedia Articles - 55k from here
Download Movie Review classifier from here
Download BBC News classifier from here
Unsupervised training using Google's sentencepiece
Download the trained model and vocabulary from here