First ever, State of the Art Tokenizer, Language model and Classifier for Punjabi language (spoken in Indian sub-continent)
NOTE: This is the first ever Language model and Classifier in Punjabi to the best of my knowledge. If you know of some previous work which has been done in NLP for Punjabi, let me know. I'll be happy to correct my statements.
Download Wikipedia Articles Dataset (44,000 articles) which I scraped, cleaned and trained model on from here
Checkout BBC Punjabi News dataset which I scraped, cleaned and trained model on from the repository path datasets-preparation/panjabi-bbc-news-dataset/
Perplexity of Language Model: ~13 (on 20% validation set)
Kappa Score of classification model: ~60
Accuracy of classification model: 89%
Note: Accuracy would be a wrong metric with the above dataset, as it was highly unbalanced, with
114 Positive Examples
670 Negative Examples
Hence, It would be better to look at Kappa Score (~60).
The above results for classification have been obtained on validation set which had ~84% negatives and ~16% positives.
Download pretrained Language Model from here
Download classifier from here
Unsupervised training using Google's sentencepiece
Download the trained model and vocabulary from here