本项目主要基于Pytorch, 验证常见的NER范式模型在不同中文NER数据集上(Flat、Nested、Discontinuous)的表现 NER系列模型实践,包括如下:
- Bert-Softmax、Bert-Crf、Bert-BiLSTM-Softmax、Bert-BiLSTM-Crf
- Word-Feature Model(词汇增强模型):FlatNER、LEBERT
- PointerNET (To do)
- MRC(Machine Reading Comprehension, MRC)
- span-based NER (To do)
mainly tested on ner dataset as below:
中文NER数据集:
- Flat NER Datasets: Ontonote4、Msra
- Nested NER Datasets:ACE 2004、 ACE 2005
- Discontinuous NER Datasets: CADEC
关于一般NER数据处理成以下格式:
{
"text": ["吴", "重", "阳", ",", "中", "国", "国", "籍",","],
"label": ["B-NAME", "I-NAME", "I-NAME", "O", "B-CONT", "I-CONT", "I-CONT", "I-CONT", "O"]
}
阅读理解-NER(MRC-NER)处理成以下格式:
{
"context": "图 为 马 拉 维 首 都 利 隆 圭 政 府 办 公 大 楼 。 ( 本 报 记 者 温 宪 摄 )",
"end_position": [4,15],
"entity_label": "NS",
"impossible": false,
"qas_id": "3820.1",
"query": "按照地理位置划分的国家,城市,乡镇,大洲",
"span_position": ["2;4", "7;15"],
"start_position": [2, 7]
}
python==3.8、transformers>=4.12.3、torch==1.8.0 Or run the shell
pip install -r requirements.txt
- config:some model parameters define
- datasets:数据管道
- losses:损失函数
- metrics:评价指标
- models:存放自己实现的BERT模型代码
- output:输出目录,存放模型、训练日志
- processors:数据处理
- script:脚本
- utils: 工具类
- train.py: 主函数
you can start training model by run the shell
- run ner model except mrc model
bash script/train.sh
- run mrc model
bash script/mrc_train.sh
top F1 score of results on test:
model/f1_score | Msra | Ontonote |
---|---|---|
BERT-Sotfmax | 0.9553 | 0.8181 |
BERT-BiLSTM-Sotfmax | 0.9566 | 0.8177 |
BERT-BiLSTM-LabelSmooth | 0.9549 | 0.8215 |
BERT-Crf | 0.9562 | 0.8218 |
BERT-BiLSTM-Crf | 0.9561 | 0.8227 |
BERT-BiLSTM-Crf-LabelSmooth | 0.9547 | 0.8216 |
BERT-BiLSTM-Crf-LEBERT | 0.9518 | 0.8094 |
BERT-BiLSTM-Sotfmax-LEBERT | 0.9544 | 0.8196 |
MRC | 0.942 | 0.812 |
GPU: 3060TI 8G
在速度上,以Msra数据集为例,train数据量41728, 完成训练花费时间大概是如下,总体来说CRF要慢不少。
model | time | batch_size |
---|---|---|
BERT-Sotfmax | 6min 14s | 24 |
BERT-BiLSTM-Sotfmax | 6min 46s | 24 |
BERT+Crf | 8min 06s | 24 |
BERT-BiLSTM-Crf | 8min 20s | 24 |
MRC | 50min 10s | 4 |