/nlp-bazel-tutorial

nlp-bazel-tutorial includes a Chromium base Library and named entity recognition and text classification based on this implementation. Named entity recognition includes span ner and mrc ner, Text classification includes BERT text classification, including their Python training and c++ engineering.

Primary LanguageC++

nlp-tutorial

nlp-tutorial is a tutorial for who is studying NLP(Natural Language Processing),This project uses Pytorch training and deploying with C++.

一、 Structures

  • base
  • third party
  • name entity recognition
  • text classification
  • data
 ├── base
 ├── build
 ├── data
 │   ├── name_entity_recognition
 │   │   ├── mrc-ner
 │   │   └── span-ner
 │   └── text_classification
 ├── name_entity_recognition
 │   ├── chatgpt2-ner
 │   ├── mrc-ner
 │   │   ├── mrc-for-flat-nested-ner
 │   │   └── onnx-cpp
 │   │       └── model
 │   └── span-ner
 │       ├── onnx-cpp
 │       │   └── model
 │       └── span-bert-ner-pytorch
 ├── testing
 ├── text_classification
 │   ├── bert-finetune
 │   └── onnx-cpp
 │       └── model
 └── third_party

base

Base is pulled into many projects. For example, various ChromeOS daemons. So the bar for adding stuff is that it must have demonstrated wide applicability. Prefer to add things closer to where they're used (i.e. "not base"), and pull into base only when needed. In a project our size, sometimes even duplication is OK and inevitable.

二、 name entity recognition

Named entity recognition includes span ner and mrc ner.

1、span ner is reference paper of SpanNER: Named EntityRe-/Recognition as Span Prediction paper, the code is reference of [https://github.com/lonePatient/BERT-NER-Pytorch], On the basis of this codes, I add the codes for converting to onnxruntime and deployment in C++.

CLUENER

The overall performance of BERT on dev:

Accuracy (entity) Recall (entity) F1 score (entity)
BERT+Softmax 0.7897 0.8031 0.7963
BERT+CRF 0.7977 0.8177 0.8076
BERT+Span 0.8132 0.8092 0.8112
BERT+Span+adv 0.8267 0.8073 0.8169
BERT-small(6 layers)+Span+kd 0.8241 0.7839 0.8051
BERT+Span+focal_loss 0.8121 0.8008 0.8064
BERT+Span+label_smoothing 0.8235 0.7946 0.8088

2、Mrc ner is advances in Shannon.AI. for more details, please see A Unified MRC Framework for Named Entity Recognition In ACL 2020. paper , the code is in [https://github.com/ShannonAI/mrc-for-flat-nested-ner] , On the basis of this codes, I add the codes for converting to onnxruntime and deployment in C++.

msra_zh

model precision Recall F1 score
BERT+MRC 0.9243 0.9113 0.9177

三、 Text Classification

Using the pre trained models for text classification。

THUCNews

model acc remarks
bert 94.83% 单纯的bert
ERNIE 94.61% 说好的中文碾压bert呢
bert_CNN 94.44% bert + CNN
bert_RNN 94.57% bert + RNN
bert_RCNN 94.51% bert + RCNN
bert_DPCNN 94.47% bert + DPCNN