/ clfzoo /
Eng / CN
clfzoo is a toolkit for text classification. We have implemented some baseline models, such as TextCNN, TextRNN, RCNN, Transformer, HAN, DPCNN. And We have designed a unified and friendly API to train / predict / test the models. Looking forward to your code contributions and suggestions.
python3+
numpy
sklearn
tensorflow>=1.6.0
git clone https://github.com/SeanLee97/clfzoo.git
cd clfzoo
project
│ README.md
│
└─── docs
│
└─── clfzoo # models
│ │ base.py # base model template
│ │ config.py # default configure
│ │ dataloader.py
│ │ instance.py # data instance
│ │ vocab.py # vocabulary
│ │ libs # layers and functions
│ │ dpcnn # implement dpcnn model
│ │ │ __init__.py # model apis
│ │ │ model.py # model
│ │ ... # implement other models
└───examples
│ ...
Each line is a document. The line format is "label \t sentence". The default word tokenizer is split by blank space, so words in sentence should split by blank space.
for english sample
greeting how are you.
for chinese sample
打招呼 你 最近 过得 怎样 啊 ?
# import model api
import clfzoo.textcnn as clf
# import model config
from clfzoo.config import ConfigTextCNN
"""define model config
You can assign value to hy-params defined on base model config (here is ConfigTextCNN)
"""
class Config(ConfigTextCNN):
def __init__(self):
# it is required to implement super() function
super(Config, self).__init__()
# it is required to provide dataset
train_file = '/path/to/train'
dev_file = '/path/to/test'
# ... other hy-params
# `training` is flag to indicate train mode.
clf.model(Config(), training=True)
# start to train
clf.train()
The train log will output to log.txt
, the model weights and checkpoint summaries will output to models
folder.
Predit the labels and probability scores.
import clfzoo.textcnn as clf
from clfzoo.config import ConfigTextCNN
class Config(ConfigTextCNN):
def __init__(self):
super(Config, self).__init__()
# the same hy-params as train
# inject config to model
clf.model(Config())
"""
Input: a list
each item in list is a sentence string split by blank space (for chinese sentence you should prepare your input data first)
"""
datas = ['how are u ?', 'what is the weather ?', ...]
"""
Return: a list
[('label 1', 'score 1'), ('label 2', 'score 2'), ...]
"""
preds = clf.predict(datas)
Predit the labels and probability scores and get result metrics. In order to calculate metrics you should provide ground-truth label.
import clfzoo.textcnn as clf
from clfzoo.config import ConfigTextCNN
class Config(ConfigTextCNN):
def __init__(self):
super(Config, self).__init__()
# the same hy-params as train
# inject config to model
clf.model(Config())
"""
Input: a list
each item in list is a sentence string split by blank space (for chinese sentence you should prepare your input data first)
"""
datas = ['how are u ?', 'what is the weather ?', ...]
labels = ['greeting', 'weather', ...]
"""
Return: a tuple
- predicts: a list
[('label 1', 'score 1'), ('label 2', 'score 2'), ...]
- metrics: a dict
{'recall': '', 'precision': '', 'f1': , 'accuracy': ''}
"""
preds, metrics = clf.test(datas, labels)
here we use smp2017-ECDT dataset as an example, which is a multi-label (31 labels)、short-text and chinese dataset.
We train all models in 20 epochs, and calculate metrics by sklearn metrics functions. As we all know fasttext is a strong baseline in text-classification, so here we give the result on fasttext
Models | Precision | Recall | F1 |
---|---|---|---|
fasttext | 0.81 | 0.81 | 0.81 |
TextCNN | 0.83 | 0.84 | 0.83 |
TextRNN | 0.84 | 0.83 | 0.82 |
RCNN | 0.86 | 0.85 | 0.85 |
DPCNN | 0.87 | 0.85 | 0.85 |
Transformer | 0.74 | 0.67 | 0.68 |
HAN | TODO | TODO | TODO |
Attention! It seems that Transformer and HAN can`t perform well now, We will fix bugs and update their result later.
- sean lee
- a single coder
- seanlee97@github.io
- x.m. li
- a undergraduate student from Shanxi University
- holahack@github
- ...
Some code modules from
Papers
- TextCNN: Convolutional Neural Networks for Sentence Classification
- DPCNN: Deep Pyramid Convolutional Neural Networks for Text Categorization
- Transformer: Attention Is All You Need
- HAN: Hierarchical Attention Networks for Document Classification
Any questions please mailto xmlee97#gmail.com