/Kashgari

Simple, Keras-powered multilingual NLP framework, allows you to build your models in 5 minutes for named entity recognition (NER), part-of-speech tagging (PoS) and text classification tasks. Includes BERT, GPT-2 and word2vec embedding.

Primary LanguagePythonApache License 2.0Apache-2.0

Kashgari

Pypi Python version Travis (.com) branch FOSSA Status Issues Contributions welcome

Simple and powerful NLP framework, build your state-of-art model in 5 minutes for named entity recognition (NER), part-of-speech tagging (PoS) and text classification tasks.

Kashgare is:

  • Human-friendly. Kashgare's code is straightforward, well documented and tested, which makes it very easy to understand and modify.
  • Powerful and simple. Kashgare allows you to apply state-of-the-art natural language processing (NLP) models to your text, such as named entity recognition (NER), part-of-speech tagging (PoS) and classification.
  • Keras based. Kashgare builds directly on Keras, making it easy to train your models and experiment with new approaches using different embeddings and model structure.
  • Easy to fine-tune. Kashgare build-in pre-trained BERT and Word2vec embedding models, which makes it very simple to fine-tune your model based on this embeddings.
  • Fully scalable. Kashgare provide a simple, fast, and scalable environment for fast experimentation.

Feature List

  • Embedding support
    • Classic word2vec embedding
    • BERT embedding
    • GPT-2 embedding
  • Sequence(Text) Classification Models
    • CNNModel
    • BLSTMModel
    • CNNLSTMModel
    • AVCNNModel
    • KMaxCNNModel
    • RCNNModel
    • AVRNNModel
    • DropoutBGRUModel
    • DropoutAVRNNModel
  • Sequence(Text) Labeling Models (NER, PoS)
    • CNNLSTMModel
    • BLSTMModel
    • BLSTMCRFModel
  • Model Training
  • Model Evaluate
  • GPU Support / Multi GPU Support
  • Customize Model

Performance

Task Language Dataset Score Detail
Named Entity Recognition Chinese People's Daily Ner Corpus 92.20 (F1) 基于 BERT 的中文命名实体识别

Roadmap

Tutorials

Here is a set of quick tutorials to get you started with the library:

There are also articles and posts that illustrate how to use Kashgari:

Quick start

Requirements and Installation

The project is based on Keras 2.2.0+ and Python 3.6+, because it is 2019 and type hints is cool.

pip install kashgari
# CPU
pip install tensorflow==1.12.0
# GPU
pip install tensorflow-gpu==1.12.0

Example Usage

lets run a text classification with CNN model over SMP 2017 ECDT Task1.

>>> from kashgari.corpus import SMP2017ECDTClassificationCorpus
>>> from kashgari.tasks.classification import CNNLSTMModel

>>> x_data, y_data = SMP2017ECDTClassificationCorpus.get_classification_data()
>>> x_data[0]
['你', '知', '道', '我', '几', '岁']
>>> y_data[0]
'chat'

# provided classification models `CNNModel`, `BLSTMModel`, `CNNLSTMModel` 
>>> classifier = CNNLSTMModel()
>>> classifier.fit(x_data, y_data)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 10)                0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 10, 100)           87500     
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 10, 32)            9632      
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 5, 32)             0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 32)                3232      
=================================================================
Total params: 153,564
Trainable params: 153,564
Non-trainable params: 0
_________________________________________________________________
Epoch 1/5
 1/35 [..............................] - ETA: 32s - loss: 3.4652 - acc: 0.0469

... 

>>> x_test, y_test = SMP2017ECDTClassificationCorpus.get_classification_data('test')
>>> classifier.evaluate(x_test, y_test)
              precision    recall  f1-score   support
         
        calc       0.75      0.75      0.75         8
        chat       0.83      0.86      0.85       154
    contacts       0.54      0.70      0.61        10
    cookbook       0.97      0.94      0.95        89
    datetime       0.67      0.67      0.67         6
       email       1.00      0.88      0.93         8
         epg       0.61      0.56      0.58        36
      flight       1.00      0.90      0.95        21
...

Run with GPT-2 Embedding

from kashgari.embeddings import GPT2Embedding
from kashgari.tasks.classification import CNNLSTMModel
from kashgari.corpus import SMP2017ECDTClassificationCorpus

gpt2_embedding = GPT2Embedding('<path-to-gpt-model-folder>', sequence_length=30)                                 
model = CNNLSTMModel(gpt2_embedding)

train_x, train_y = SMP2017ECDTClassificationCorpus.get_classification_data()
model.fit(train_x, train_y)

Run with Bert Embedding

from kashgari.embeddings import BERTEmbedding
from kashgari.tasks.classification import CNNLSTMModel
from kashgari.corpus import SMP2017ECDTClassificationCorpus

bert_embedding = BERTEmbedding('bert-base-chinese', sequence_length=30)                                   
model = CNNLSTMModel(bert_embedding)

train_x, train_y = SMP2017ECDTClassificationCorpus.get_classification_data()
model.fit(train_x, train_y)

Run with Word2vec Embedding

from kashgari.embeddings import WordEmbeddings
from kashgari.tasks.classification import CNNLSTMModel
from kashgari.corpus import SMP2017ECDTClassificationCorpus

bert_embedding = WordEmbeddings('sgns.weibo.bigram', sequence_length=30)                                  
model = CNNLSTMModel(bert_embedding)
train_x, train_y = SMP2017ECDTClassificationCorpus.get_classification_data()
model.fit(train_x, train_y)

Support for Training on Multiple GPUs

from kashgari.embeddings import BERTEmbedding
from kashgari.tasks.classification import CNNLSTMModel

train_x, train_y = prepare_your_classification_data()

# build model with embedding
bert_embedding = BERTEmbedding('bert-large-cased', sequence_length=128)
model = CNNLSTMModel(bert_embedding)

# or without pre-trained embedding
model = CNNLSTMModel()

# Build model with your corpus
model.build_model(train_x, train_y)

# Add multi gpu support
model.build_multi_gpu_model(gpus=8)

# Train, 256 / 8 = 32 samples for every GPU per batch
model.fit(train_x, train_y, batch_size=256)

Contributing

Thanks for your interest in contributing! There are many ways to get involved; start with the contributor guidelines and then check these open issues for specific tasks.

Reference

This library is inspired by and references following frameworks and papers.

License

FOSSA Status