/NERTasks

Primary LanguagePythonGNU Lesser General Public License v3.0LGPL-3.0

NERTasks

Contents

What's It?

A simple NER framework.

It implements:

ItemSource/Reference
Models
BiLSTM-LinearLong Short-term Memory
BiLSTM-Linear-CRFNeural Architectures for Named Entity Recognition
BERT-LinearBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT-Linear-CRF
BERT-BiLSTM-Linear
BERT-BiLSTM-Linear-CRF
BERT(Prompt)
EntLM Approach
Template-free Prompt Tuning for Few-shot NER
Datasets
CoNLL2003 yuanxiaosc/BERT-for-Sequence-Labeling-and-Text-Classification
OntoNotes5LDC2013T19
CCKS2019 Subtask 1TIANCHI (NER on Chinese medical documents.)
NCBI-diseaseBioBERT (NER on English medical doucments, Got these datasets from its download.sh)
Traning Tricks
Gradient Accumulation
Learning Rate Warmup
Misc
Tokenizer from datasetsSee myutils.py
NER Metricsseqeval: A Python framework for sequence labeling evaluation

You can easily add your own models and datasets into this framework.

Requirements:

Linux(Tested)/Windows(Not Tested) with Nvidia GPUs.

Install Dependencies.

Recommend to use conda creating a python environment(python==3.9). For example:

conda create -n NER python=3.9

And run the bash script. If you are using windows, change its extname to .bat.

./install_dependencies.sh

How To Prepare Datasets

For some reason(copyright and some other things), I can't directly provide datasets to you. You should get the access to these datasets by yourself and put them in specified format into 'assert/raw_datasets' folder, see here.

Experiments

Hyper Parameters

Optimizer Weight
Decay
Warmup
Ratio
Batch Size Gradient
Accumulation
Clip Grad Norm Random Seed
AdamW 5e-3 0.2 1 32 1.0 233

Training Epoches:

DatasetFull DataFew Shot
CoNLL2003 12 30
OntoNotes5(Chinese)
CCKS2019
NCBI-disease20

Learning Rates:

CoNLL2003 OntoNotes5 CCKS2019 NCBI-disease
BiLSTM-Linear 0.001 NA
BiLSTM-Linear-CRF
BERT-Linear 0.0001
BERT-Linear-CRF
BERT-BiLSTM-Linear 0.0001 3e-5 NA
BERT-BiLSTM-Linear-CRF 1e-5
BERT(Prompt) 3e-5 0.0001

Model Parameters

BERT Model Embedding Size(For models without BERT) LSTM Hidden Size LSTM Layers
bert-base-uncased(CoNLL2003,NCBI-disease) 256 256 2
bert-base-chinese(OntoNotes5,CCKS2019)

Results

Full Data Results

General datasets(ConLL2003, OntoNotes5(Chinese)).

Dataset Model Overall Span-Based Micro F1 Average Training Time Per Epoch
(On a Quadro RTX8000)
CoNLL2003BiLSTM-Linear 0.6517005491858561 13.98s
BiLSTM-Linear-CRF 0.6949365863103882 44.07s
BERT-Linear 0.8983771483322356 81.81s
BERT-Linear-CRF 0.8977943835121128 120.94s
BERT-BiLSTM-Linear 0.8819152766110644 117.37s
BERT-BiLSTM-Linear-CRF 0.8873846891098599 130.85s
BERT(Prompt) 0.9230769230769231 99.70s
OntoNotes5
(Chinese)
BiLSTM-Linear 0.637999350438454 160.55s
BiLSTM-Linear-CRF 0.7033358449208851 319.87s
BERT-Linear 0.7403041825095057 413.20s
BERT-Linear-CRF 0.7535838822161953 595.71s
BERT-BiLSTM-Linear 0.7511438739196745 590.53s
BERT-BiLSTM-Linear-CRF 0.7616389699353039 800.23s
BERT(Prompt) 0.7376454875023851 485.56s

Medical datasets, used general bert and medical bert.

DatasetBERTModelOverall Span-Based Micro F1
CCKS2019
Subtask1
bert-
base-
chinese
BERT-Linear0.8057400574005741
BERT-Linear-CRF0.8119778310861113
BERT-Prompt0.7684884784959654
medbert-
base-
chinese
BERT-Linear0.8201214508452324
BERT-Linear-CRF0.8221622063998691
BERT-Prompt0.7933091394485463
NCBI-disease bert-
base-
uncased
BERT-Linear0.8720903433970961
BERT-Linear-CRF0.8778661675245673
BERT-Prompt0.8319672131147542
biobert-base-
cased-v1.2
BERT-Linear0.8730125079499682
BERT-Linear-CRF0.8774603174603175
BERT-Prompt0.8549382716049383

Few Shot Results

Sampling 1% data in trainset by fixed random seed. They used the same hyper parameters in full data experiments.

Few Shot Test on CoNLL2003:

Sampled 69 samples(total 6973). Here list the number of entities in few shot dataset:

{'MISC': 51, 'ORG': 51, 'PER': 59, 'LOC': 90}
ModelOverall Span-Based F1 On Full Testset
bert-base-uncased
BERT-Linear0.6778304852260387
BERT-Linear-CRF0.6773130256876562
BERT-Prompt0.7524185216492908
BERT-BiLSTM-Linear0.037065541975802724
BERT-BiLSTM-Linear-CRF0.029508301201363885

Few Shot Test on CCKS2019:

Sampled 10 samples(total 1000). Here list the numbers of entities in few shot dataset:

{'手术': 9, '影像检查': 5, '疾病和诊断': 45, '解剖部位': 48, '实验室检验': 19, '药物': 10}
ModelOverall Span-Based F1 On Full Testset
bert-base-chinesemedbert-base-chinese
BERT-Linear0.439180645705133540.47296831955922863
BERT-Linear-CRF0.479018079283463240.537369759619329
BERT-Prompt0.00388523610280932470.43338090840399623

Few Shot Test on NCBI-disease:

Sampled 54 samples(total about 5400) contains 41 disease entities.

ModelOverall Span-Based F1 On Full Testset
bert-base-uncasedmedbert-base-chinese
BERT-Linear0.64499769479022580.6617314414970182
BERT-Linear-CRF0.65206124852767970.6777615976700645
BERT-Prompt0.55525997581620310.5978593272171254

Acknowledgement And Citations

People And Orgnizations

  • BJTU-NLP

Third-Party Libraries

  • pytorch
  • transformers
  • datasets
  • seqeval
  • ujson
  • tqdm
  • matplotlib