中文分词,使用 最大词匹配、CRF(CRF++)、Bi-LSTM、BERT-Bi-LSTM
Chinese Word Segmentation Using
-
Maximum Matching
-
CRF (CRF++)
-
Bi-LSTM (+ CRF)
-
BERT + Bi-LSTM
SIGHAN Second International Chinese Word Segmentation Bakeoff
Trained and tested with pku dataset
Method | Precision | Recall | F1-Score |
---|---|---|---|
FMM | 0.802 | 0.781 | 0.791 |
RMM | 0.805 | 0.784 | 0.794 |
BIMM | 0.806 | 0.785 | 0.795 |
Template | Precision | Recall | F1-Score |
---|---|---|---|
crf_template | 0.938 | 0.923 | 0.931 |
Structure | Precision | Recall | F1-Score |
---|---|---|---|
emb256_hid256_l3 | 0.9252 | 0.9237 | 0.9243 |
Structure | Precision | Recall | F1-Score |
---|---|---|---|
emb256_hid256_l3 | 0.9343 | 0.9336 | 0.9339 |
Structure | Precision | Recall | F1-Score |
---|---|---|---|
emb768_hid512_l2 | 0.9698 | 0.9650 | 0.9646 |
-
select a method in
dict_based.py
and run -
mm_score.bat
for scoring
-
install
CRF++
-
crf.py
for preprocess and postprocess -
crf_*.bat
scripts for training, testing and scoring
-
Edit model configs in
models/config.py
-
Run
python -u main.py
for training and evaluation,python -u test.py
for evaluation only
-
Download pretrained models following the instruction in
pretrained_model.md
-
Edit model config in
config.json
-
Run
python -u train.py
for training,python -u eval.py
for evaluation
https://github.com/hiyoung123/ChineseSegmentation
https://github.com/luopeixiang/named_entity_recognition
https://github.com/AOZMH/BERT-LSTM-Chinese-Word-Segmentation