
Simple, Keras-powered multilingual NLP framework, allows you to build your models in 5 minutes for named entity recognition (NER), part-of-speech tagging (PoS) and text classification tasks. Includes BERT and word2vec embedding.

Kashgare is:

  • Human-friendly. Kashgare's code is straightforward, well documented and tested, which makes it very easy to understand and modify.
  • Powerful and simple. Kashgare allows you to apply state-of-the-art natural language processing (NLP) models to your text, such as named entity recognition (NER), part-of-speech tagging (PoS) and classification.
  • Keras based. Kashgare builds directly on Keras, making it easy to train your models and experiment with new approaches using different embeddings and model structure.
  • Easy to fine-tune. Kashgare build-in pre-trained BERT and Word2vec embedding models, which makes it very simple to fine-tune your model based on this embeddings.
  • Fully scalable. Kashgare provide a simple, fast, and scalable environment for fast experimentation.

Feature List

  • Embedding support
    • Classic word2vec embedding
    • BERT embedding
  • Sequence(Text) Classification Models
    • CNNModel
    • BLSTMModel
    • CNNLSTMModel
    • AVCNNModel
    • KMaxCNNModel
    • RCNNModel
    • AVRNNModel
    • DropoutBGRUModel
    • DropoutAVRNNModel
  • Sequence(Text) Labeling Models (NER, PoS)
    • CNNLSTMModel
    • BLSTMModel
    • BLSTMCRFModel
  • Model Training
  • Model Evaluate
  • GPU Support
  • Customize Model


Task Language Dataset Score Detail
Named Entity Recognition Chinese People's Daily Ner Corpus 92.20 (F1) 基于 BERT 的中文命名实体识别


  • ELMo Embedding
  • Pre-trained models
  • More model structure


Quick start

Requirements and Installation

The project is based on Keras 2.2.0+ and Python 3.6+, because it is 2019 and type hints is cool.

pip install kashgari
pip install tensorflow==1.12.0
pip install tensorflow-gpu==1.12.0

Example Usage

lets run a text classification with CNN model over SMP 2017 ECDT Task1.

>>> from kashgari.corpus import SMP2017ECDTClassificationCorpus
>>> from kashgari.tasks.classification import CNNLSTMModel

>>> x_data, y_data = SMP2017ECDTClassificationCorpus.get_classification_data()
>>> x_data[0]
['你', '知', '道', '我', '几', '岁']
>>> y_data[0]

# provided classification models `CNNModel`, `BLSTMModel`, `CNNLSTMModel` 
>>> classifier = CNNLSTMModel()
>>> classifier.fit(x_data, y_data)
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 10)                0         
embedding_1 (Embedding)      (None, 10, 100)           87500     
conv1d_1 (Conv1D)            (None, 10, 32)            9632      
max_pooling1d_1 (MaxPooling1 (None, 5, 32)             0         
lstm_1 (LSTM)                (None, 100)               53200     
dense_1 (Dense)              (None, 32)                3232      
Total params: 153,564
Trainable params: 153,564
Non-trainable params: 0
Epoch 1/5
 1/35 [..............................] - ETA: 32s - loss: 3.4652 - acc: 0.0469


>>> x_test, y_test = SMP2017ECDTClassificationCorpus.get_classification_data('test')
>>> classifier.evaluate(x_test, y_test)
              precision    recall  f1-score   support
        calc       0.75      0.75      0.75         8
        chat       0.83      0.86      0.85       154
    contacts       0.54      0.70      0.61        10
    cookbook       0.97      0.94      0.95        89
    datetime       0.67      0.67      0.67         6
       email       1.00      0.88      0.93         8
         epg       0.61      0.56      0.58        36
      flight       1.00      0.90      0.95        21

Run with Bert Embedding

from kashgari.embeddings import BERTEmbedding
from kashgari.tasks.classification import CNNLSTMModel
from kashgari.corpus import SMP2017ECDTClassificationCorpus

bert_embedding = BERTEmbedding('bert-base-chinese', sequence_length=30)                                   
model = CNNLSTMModel(bert_embedding)

train_x, train_y = SMP2017ECDTClassificationCorpus.get_classification_data()
model.fit(train_x, train_y)

Run with Word2vec Embedding

from kashgari.embeddings import WordEmbeddings
from kashgari.tasks.classification import CNNLSTMModel
from kashgari.corpus import SMP2017ECDTClassificationCorpus

bert_embedding = WordEmbeddings('sgns.weibo.bigram', sequence_length=30)                                  
model = CNNLSTMModel(bert_embedding)
train_x, train_y = SMP2017ECDTClassificationCorpus.get_classification_data()
model.fit(train_x, train_y)


Thanks for your interest in contributing! There are many ways to get involved; start with the contributor guidelines and then check these open issues for specific tasks.


This library is inspired by and references following frameworks and papers.


