nlp-bazel-tutorial: A C++ repository from dawoshi

nlp-tutorial

nlp-tutorial is a tutorial for who is studying NLP(Natural Language Processing)，This project uses Pytorch training and deploying with C++.

一、 Structures

base
third party
name entity recognition
text classification
data

 ├── base
 ├── build
 ├── data
 │   ├── name_entity_recognition
 │   │   ├── mrc-ner
 │   │   └── span-ner
 │   └── text_classification
 ├── name_entity_recognition
 │   ├── chatgpt2-ner
 │   ├── mrc-ner
 │   │   ├── mrc-for-flat-nested-ner
 │   │   └── onnx-cpp
 │   │       └── model
 │   └── span-ner
 │       ├── onnx-cpp
 │       │   └── model
 │       └── span-bert-ner-pytorch
 ├── testing
 ├── text_classification
 │   ├── bert-finetune
 │   └── onnx-cpp
 │       └── model
 └── third_party

base

Base is pulled into many projects. For example, various ChromeOS daemons. So the bar for adding stuff is that it must have demonstrated wide applicability. Prefer to add things closer to where they're used (i.e. "not base"), and pull into base only when needed. In a project our size, sometimes even duplication is OK and inevitable.

二、 name entity recognition

Named entity recognition includes span ner and mrc ner.

1、span ner is reference paper of SpanNER: Named EntityRe-/Recognition as Span Prediction paper, the code is reference of [https://github.com/lonePatient/BERT-NER-Pytorch], On the basis of this codes, I add the codes for converting to onnxruntime and deployment in C++.

CLUENER

The overall performance of BERT on dev:

	Accuracy (entity)	Recall (entity)	F1 score (entity)
BERT+Softmax	0.7897	0.8031	0.7963
BERT+CRF	0.7977	0.8177	0.8076
BERT+Span	0.8132	0.8092	0.8112
BERT+Span+adv	0.8267	0.8073	0.8169
BERT-small(6 layers)+Span+kd	0.8241	0.7839	0.8051
BERT+Span+focal_loss	0.8121	0.8008	0.8064
BERT+Span+label_smoothing	0.8235	0.7946	0.8088

2、Mrc ner is advances in Shannon.AI. for more details, please see A Unified MRC Framework for Named Entity Recognition In ACL 2020. paper , the code is in [https://github.com/ShannonAI/mrc-for-flat-nested-ner] , On the basis of this codes, I add the codes for converting to onnxruntime and deployment in C++.

msra_zh

model	precision	Recall	F1 score
BERT+MRC	0.9243	0.9113	0.9177

三、 Text Classification

Using the pre trained models for text classification。

THUCNews

model	acc	remarks
bert	94.83%	单纯的bert
ERNIE	94.61%	说好的中文碾压bert呢
bert_CNN	94.44%	bert + CNN
bert_RNN	94.57%	bert + RNN
bert_RCNN	94.51%	bert + RCNN
bert_DPCNN	94.47%	bert + DPCNN

dawoshi/nlp-bazel-tutorial