/sbnltk-bangla-nltk

Bangla NLP toolkit. Bangla NER, POStag, Stemmer, Word embedding, sentence embedding, summarization, preprocessor, sentiment analysis, etc.

Primary LanguagePythonMIT LicenseMIT

pypi-download-stats

PyPI version shields.io PyPI license PyPI pyversions PyPI download month PyPI download week

Please use colab for getting no problem. For transformer model, please install simpleTransformer first or use bn_nlp for static models. I uploaded dataset and training details in my github. There is a problem in sentiment analyzer. I Will fix it soon.

SBNLTK

SUST-Bangla Natural Language toolkit. A python module for Bangla NLP tasks.
Demo Version : 2.0.2
NEED python 3.6+ vesrion!! Use virtual Environment for not getting unessessary Issues!!

INSTALLATION

PYPI INSTALLATION

pip3 install sbnltk
pip3 install simpletransformers
pip3 install fasttext
pip3 install scikit-learn

MANUAL INSTALLATION FROM GITHUB

  • Clone this project
  • Install all the requirements
  • Call the setup.py from terminal

What will you get here?

  • Bangla Text Preprocessor
  • Bangla word dust,punctuation,stop word removal
  • Bangla word sorting according to Bangla or English alphabet
  • Bangla word normalization
  • Bangla word stemmer
  • Bangla Sentiment analysis(logisticRegression,LinearSVC,Multilnomial_naive_bayes,Random_Forst)
  • Bangla Sentiment analysis with Bert
  • Bangla sentence pos tagger (static, sklearn)
  • Bangla sentence pos tagger with BERT(Multilingual-cased,Multilingual uncased)
  • Bangla sentence NER(Static,sklearn)
  • Bangla sentence NER with BERT(Bert-Cased, Multilingual Cased/Uncased)
  • Bangla word word2vec(gensim,glove,fasttext)
  • Bangla sentence embedding(Contexual,Transformer/Bert)
  • Bangla Document Summarization(Feature based, Contexual, sementic Based)
  • Bangla Bi-lingual project(Bangla to english google translator without blocking IP)
  • Bangla document information Extraction

SEE THE CODE DOCS FOR USES!

TASKS, MODELS, ACCURACY, DATASET AND DOCS

TASK MODEL ACCURACY DATASET About Code DOCS
Preprocessor Punctuation, Stop Word, DUST removal Word normalization, others.. ------ ----- docs
Word tokenizers basic tokenizers Customized tokenizers ---- ---- docs
Sentence tokenizers Basic tokenizers Customized tokenizers Sentence Cluster ----- ----- docs
Stemmer StemmerOP 85.5% ---- docs
Sentiment Analysis logisticRegression 88.5% 20,000+ docs
LinearSVC 82.3% 20,000+ docs
Multilnomial_naive_bayes 84.1% 20,000+ docs
Random Forest 86.9% 20,000+ docs
BERT 93.2% 20,000+ docs
POS tagger Static method 55.5% 1,40,973 words docs
SK-LEARN classification 81.2% 6,000+ sentences docs
BERT-Multilingual-Cased 69.2% 6,000+ docs
BERT-Multilingual-Uncased 78.7% 6,000+ docs
NER tagger Static method 65.3% 4,08,837 Entity docs
SK-LEARN classification 81.2% 65,000+ docs
BERT-Cased 79.2% 65,000+ docs
BERT-Mutilingual-Cased 75.5% 65,000+ docs
BERT-Multilingual-Uncased 90.5% 65,000+ docs
Word Embedding Gensim-word2vec-100D- 1,00,00,000+ tokens - 2,00,00,000+ sentences docs
Glove-word2vec-100D- 2,30,000+ tokens - 5,00,000 sentences docs
fastext-word2vec-200D 3,00,000+ - 5,00,000 sentences docs
Sentence Embedding Contextual sentence embedding - ----- docs
Transformer embedding_hd - 3,00,000+ human data docs
Transformer embedding_gd - 3,00,000+ google data docs
Extractive Summarization Feature-based based 70.0% f1 score ------ docs
Transformer sentence sentiment Based 67.0% ------ docs
Word2vec--sentences contextual Based 60.0% ----- docs
Bi-lingual projects google translator with large data detector ---- ---- docs
Information Extraction Static word features - docs
Semantic and contextual - docs
Bangla Coreference Resolution -

Next releases after testing this demo

Task Version
Coreference Resolution v1.1
Language translation V1.1
Masked Language model V1.1
Information retrieval Projects V1.1
Entity Segmentation v1.3
Factoid Question Answering v1.2
Question Classification v1.2
sentiment Word embedding v1.3
So many others features ---

Package Installation

You have to install these packages manually, if you get any module error.

  • simpletransformers
  • fasttext

Models

Everything is automated here. when you call a model for the first time, it will be downloaded automatically.

With GPU or Without GPU

  • With GPU, you can run any models without getting any warnings.
  • Without GPU, You will get some warnings. But this will not affect in result.

Motivation

With approximately 228 million native speakers and another 37 million as second language speakers,Bengali is the fifth most-spoken native language and the seventh most spoken language by total number of speakers in the world. But still it is a low resource language. Why?

Dataset

For all sbnltk dataset and existing Dataset, see this link Bangla NLP Dataset

Trainer

For training, You can see this Colab Trainer . In future i will make a Trainer module!

When will full version come?

Very soon. We are working on paper and improvement our modules. It will be released sequentially.

About accuracy

Accuracy can be varied for the different datasets. We measure our model with random datasets but small scale. As human resources for this project are not so large.

Contribute Here

  • If you found any issue, please create an issue or contact with me.