This repository contains State of the Art Language models and Classifier
for Gujarati, which is a language native to the Indian state of Gujarat.
The models trained here have been used in Natural Language Toolkit for Indic Languages
(iNLTK)
Created as part of this project
-
Gujarati Wikipedia Articles
-
Gujarati News Dataset
- iNLTK Headlines Corpus - Gujarati : Uses the Gujarati News Dataset prepared above.
Language Model Perplexity (on validation set)
Architecture/Dataset |
Gujarati Wikipedia Articles |
ULMFiT |
34.12 |
TransformerXL |
28.12 |
Dataset |
Accuracy |
MCC |
Notebook to Reproduce results |
iNLTK Headlines Corpus - Gujarati |
91.05 |
86.09 |
Link |
Results of using Transfer Learning + Data Augmentation from iNLTK
On using complete training set (with Transfer learning)
Dataset |
Dataset size (train, valid, test) |
Accuracy |
MCC |
Notebook to Reproduce results |
iNLTK Headlines Corpus - Gujarati |
(5269, 659, 659) |
91.05 |
86.09 |
Link |
On using 10% of training set (with Transfer learning)
Dataset |
Dataset size (train, valid, test) |
Accuracy |
MCC |
Notebook to Reproduce results |
iNLTK Headlines Corpus - Gujarati |
(526, 659, 659) |
80.88 |
70.18 |
Link |
On using 10% of training set (with Transfer learning + Data Augmentation)
Dataset |
Dataset size (train, valid, test) |
Accuracy |
MCC |
Notebook to Reproduce results |
iNLTK Headlines Corpus - Gujarati |
(526, 659, 659) |
81.03 |
70.44 |
Link |
Download pretrained Language Models from here
Trained tokenizer using Google's sentencepiece
Download the trained model and vocabulary from here