NERTasks

5. Acknowledgement And Citations

What's It?

A simple NER framework.

It implements:

	Item	Source/Reference
Models
	BiLSTM-Linear	Long Short-term Memory
	BiLSTM-Linear-CRF	Neural Architectures for Named Entity Recognition
	BERT-Linear	BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
	BERT-Linear-CRF
	BERT-BiLSTM-Linear
	BERT-BiLSTM-Linear-CRF
	BERT(Prompt) EntLM Approach	Template-free Prompt Tuning for Few-shot NER
Datasets
	CoNLL2003	yuanxiaosc/BERT-for-Sequence-Labeling-and-Text-Classification
	OntoNotes5	LDC2013T19
	CCKS2019 Subtask 1	TIANCHI (NER on Chinese medical documents.)
	NCBI-disease	BioBERT (NER on English medical doucments, Got these datasets from its download.sh)
Traning Tricks
	Gradient Accumulation
	Learning Rate Warmup
Misc
	Tokenizer from datasets	See myutils.py
	NER Metrics	seqeval: A Python framework for sequence labeling evaluation

You can easily add your own models and datasets into this framework.

Requirements:

Linux(Tested)/Windows(Not Tested) with Nvidia GPUs.

Install Dependencies.

Recommend to use conda creating a python environment(python==3.9). For example:

conda create -n NER python=3.9

And run the bash script. If you are using windows, change its extname to .bat.

./install_dependencies.sh

How To Prepare Datasets

For some reason(copyright and some other things), I can't directly provide datasets to you. You should get the access to these datasets by yourself and put them in specified format into 'assert/raw_datasets' folder, see here.

Experiments

Hyper Parameters

Optimizer	Weight Decay	Warmup Ratio	Batch Size	Gradient Accumulation	Clip Grad Norm	Random Seed
AdamW	5e-3	0.2	1	32	1.0	233

Training Epoches:

Dataset	Full Data	Few Shot
CoNLL2003	12	30
OntoNotes5(Chinese)
CCKS2019
NCBI-disease	20

Learning Rates:

	CoNLL2003	OntoNotes5	CCKS2019	NCBI-disease
BiLSTM-Linear	0.001		NA
BiLSTM-Linear-CRF	0.001		NA
BERT-Linear	0.0001
BERT-Linear-CRF	0.0001
BERT-BiLSTM-Linear	0.0001	3e-5	NA
BERT-BiLSTM-Linear-CRF		1e-5	NA
BERT(Prompt)		3e-5	0.0001

Model Parameters

BERT Model	Embedding Size(For models without BERT)	LSTM Hidden Size	LSTM Layers
bert-base-uncased(CoNLL2003,NCBI-disease)	256	256	2
bert-base-chinese(OntoNotes5,CCKS2019)	256	256	2

Results

Full Data Results

General datasets(ConLL2003, OntoNotes5(Chinese)).

Dataset	Model	Overall Span-Based Micro F1	Average Training Time Per Epoch (On a Quadro RTX8000)
CoNLL2003	BiLSTM-Linear	0.6517005491858561	13.98s
	BiLSTM-Linear-CRF	0.6949365863103882	44.07s
	BERT-Linear	0.8983771483322356	81.81s
	BERT-Linear-CRF	0.8977943835121128	120.94s
	BERT-BiLSTM-Linear	0.8819152766110644	117.37s
	BERT-BiLSTM-Linear-CRF	0.8873846891098599	130.85s
	BERT(Prompt)	0.9230769230769231	99.70s
OntoNotes5 (Chinese)	BiLSTM-Linear	0.637999350438454	160.55s
	BiLSTM-Linear-CRF	0.7033358449208851	319.87s
	BERT-Linear	0.7403041825095057	413.20s
	BERT-Linear-CRF	0.7535838822161953	595.71s
	BERT-BiLSTM-Linear	0.7511438739196745	590.53s
	BERT-BiLSTM-Linear-CRF	0.7616389699353039	800.23s
	BERT(Prompt)	0.7376454875023851	485.56s

Medical datasets, used general bert and medical bert.

Dataset	BERT	Model	Overall Span-Based Micro F1
CCKS2019 Subtask1	bert- base- chinese	BERT-Linear	0.8057400574005741
		BERT-Linear-CRF	0.8119778310861113
		BERT-Prompt	0.7684884784959654
	medbert- base- chinese	BERT-Linear	0.8201214508452324
		BERT-Linear-CRF	0.8221622063998691
		BERT-Prompt	0.7933091394485463
NCBI-disease	bert- base- uncased	BERT-Linear	0.8720903433970961
		BERT-Linear-CRF	0.8778661675245673
		BERT-Prompt	0.8319672131147542
	biobert-base- cased-v1.2	BERT-Linear	0.8730125079499682
		BERT-Linear-CRF	0.8774603174603175
		BERT-Prompt	0.8549382716049383

Few Shot Results

Sampling 1% data in trainset by fixed random seed. They used the same hyper parameters in full data experiments.

Few Shot Test on CoNLL2003:

Sampled 69 samples(total 6973). Here list the number of entities in few shot dataset:

{'MISC': 51, 'ORG': 51, 'PER': 59, 'LOC': 90}

Model	Overall Span-Based F1 On Full Testset
Model	bert-base-uncased
BERT-Linear	0.6778304852260387
BERT-Linear-CRF	0.6773130256876562
BERT-Prompt	0.7524185216492908
BERT-BiLSTM-Linear	0.037065541975802724
BERT-BiLSTM-Linear-CRF	0.029508301201363885

Few Shot Test on CCKS2019:

Sampled 10 samples(total 1000). Here list the numbers of entities in few shot dataset:

{'手术': 9, '影像检查': 5, '疾病和诊断': 45, '解剖部位': 48, '实验室检验': 19, '药物': 10}

Model	Overall Span-Based F1 On Full Testset
Model	bert-base-chinese	medbert-base-chinese
BERT-Linear	0.43918064570513354	0.47296831955922863
BERT-Linear-CRF	0.47901807928346324	0.537369759619329
BERT-Prompt	0.0038852361028093247	0.43338090840399623

Few Shot Test on NCBI-disease:

Sampled 54 samples(total about 5400) contains 41 disease entities.

Model	Overall Span-Based F1 On Full Testset
Model	bert-base-uncased	medbert-base-chinese
BERT-Linear	0.6449976947902258	0.6617314414970182
BERT-Linear-CRF	0.6520612485276797	0.6777615976700645
BERT-Prompt	0.5552599758162031	0.5978593272171254

Acknowledgement And Citations

People And Orgnizations

BJTU-NLP

Third-Party Libraries

pytorch
transformers
datasets
seqeval
ujson
tqdm
matplotlib

sxysxy/NERTasks

NERTasks

Contents

What's It?

1. Requrements

2. Install Dependencies

3. How To Prepare Datasets

4. Experiments

4.1 Hyper Parameters

4.2 Model Parameters

4.3 Results

4.3.1 Full Data Results

4.3.2 Few Shot Results

5. Acknowledgement And Citations

5.1 People And Orgnizations

5.2 Third-Party Libraries

What's It?

A simple NER framework.

Requirements:

Install Dependencies.

How To Prepare Datasets

Experiments

Hyper Parameters

Model Parameters

Results

Full Data Results

Few Shot Results

Acknowledgement And Citations

People And Orgnizations

Third-Party Libraries