AL-NER

In this project, we use pre-trained word2vec to implement word embedding, choose BiLstm as encoder and CRF as decoder.To evaluate the active learning strategies, we also implement several sample selection strategies based on uncertainty.

Reference

Please cite the paper, if this project/paper contribute to your research.

@misc{liu2020ltp,
    title={LTP: A New Active Learning Strategy for CRF-Based Named Entity Recognition},
    author={Mingyi Liu and Zhiying Tu Tong Zhang and Tonghua Su and Zhongjie Wang},
    year={2020},
    eprint={2001.02524},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

You can find the latest version of this paper: https://www.researchgate.net/publication/338476927_LTP_A_New_Active_Learning_Strategy_for_CRF-Based_Named_Entity_Recognition

Note: Arxiv status still on hold..

Word Embedding

In this project, we use a 300d word embedding pre-trained on the Chinese Wikipedia corpus for the Chinese datasets, and a 100d glove word embedding pre-trained on the English Wikipedia corpus for the English. You can get them from the download link below, then you need to convert these files to .pkl files.

BiLstm-CRF

BiLstm-CRF has been widely used in named entity recognition on several typical datasets.

Sample selection strategies

In this project, the following selection strategies are implemented.

RANDOM: RandomStrategy
LC: LeastConfidenceStrategy
NLC: NormalizedLeastConfidenceStrategy
LTP: LeastTokenProbabilityStrategy
MTP: MinimumTokenProbabilityStrategy
MTE: MaximumTokenEntropyStrategy
LONG: LongStrategy
TE: TokenEntropyStrategy

Prerequisites

python 3.6
pytorch 1.5.1
numpy 1.19.1
sklearn 0.23.1
seqeval
colorama

Datasets

We have experimented and evaluate the active learning strategies mentioned above on four Chinese datasets and two english datasets.We get these datasets from the dounload link below, then carry out some data preprocessing operations on these files, such as dividing the extra long sentences through ','.

Dataset Struct

You can find some sample files which contain part of the datasets under the directory of datasets.In this project, we store datasets in the following structure.

datasets
    |
    | --- dataset1
                | --- train.txt
                | --- test.txt
                | --- tags.txt
    | --- dataset2
                | --- train.txt
                | --- test.txt
                | --- tags.txt
    | --- ....

Basic Description For Each Dataset

Name	Description	Language
People’s Daily	a collection of newswire article annotated with 3 balanced entity types	Chinese
Boson_NER	a set of online news annotations published by bosonNLP, which contains 6 entity types	Chinese
Weibo_NER	a collection of short blogs posted on Chinese social media Weibo with 8 extremely unbalanced entity types	Chinese
OntoNotes-5.0	a collection of broadcast news articles, which contains 18 entity types	Chinese
CONLL2003	a well known english dataset consists of Reuters news stories between August 1996 and August 1997, which contains 4 different entity types	English
Ritter	a english dataset consist of tweets annotated with 10 different entity types	English

Basic Statistics

Name	#S	#T	#E	ASL	ASE	AEL	%PT	%AC	%DAC
BosonNLP-train	27350	409830	6	14.98	0.67	3.93	17.7%	41.8%	14.7%
BosonNLP-test	6825	99616	6	14.59	0.67	3.87	17.8%	41.8%	14.8%
Weibo_NER-train	3664	85571	8	23.35	0.62	2.60	6.9%	33.6%	14.8%
Weibo_NER-test	591	13810	8	23.36	0.66	2.60	7.3%	36.3%	17.7%
OntoNotes5.0_NER-train	13798	362508	18	26.27	1.91	3.14	22.8%	72.5%	48.0%
OntoNotes5.0_NER-test	1710	44790	18	26.19	1.99	3.07	23.4%	75.4%	51.5%
PeopleDaily-train	50658	2169879	3	42.83	1.47	3.23	11.1%	58.3%	35.8%
PeopleDaily-test	4620	172590	3	37.35	1.33	3.25	11.6%	54.4%	29.1%
CONLL2003-train	13862	203442	4	14.67	1.69	1.44	16.7%	79.9%	44.2%
CONLL2003-test	3235	51347	4	15.87	1.83	1.44	16.7%	80.4%	48.8%
Ritter-train	1955	37735	10	19.30	0.62	1.65	5.3%	38.1%	15.3%
Ritter-test	438	8733	10	19.93	0.60	1.62	4.9%	39.2%	15.5%

Usage

Modify the configuration file as required (al-ner-demo/config_files/bilstm-crf-al.config)

[LOGGER]
logdir_prefix=../logger # please make sure that this directory exists

[WORDEMBEDDING]
method=WORD2VEC 

[MODELTRAIN]
method=BiLSTMCRF

[WORD2VEC]
all_word_embedding_path=../embedding/merge_sgns_bigram_char300.pkl
choose_fraction=0.01
courpus_file=../datasets/BosonNLP_NER_6C/
courpus_name=BosonNLP_NER_6C
embedding_dim=300
entity_type=6
max_seq_len=64
tags_file=../datasets/BosonNLP_NER_6C/tags.txt

[BiLSTMCRF]
batch_size=64
device=cuda:0
embedding_dim=300
hidden_dim=200
num_rnn_layers=1
num_epoch=25
learning_rate=1e-3
model_path_prefix=../model/word2vec_bilstm_crf_ltp # please make sure that ../model exists

[ENTITYLEVELF1]
average=micro
digits=2
return_report=False

[ActiveStrategy]
strategy=LTP #other options:RANDOM,LC,NLC,MNLP,MTP,MTE,LONG,TE
stop_echo=25
query_batch_fraction=0.02

According to the above configuration file, log file will be saved under the directory below. AL-NER/logger/BosonNLP_NER_6C/WORD2VEC_BiLSTMCRF_LTP/

Type the command line and try to run it

cd al-ner-demo/pipelines/
python -u Word2VecBiLSTMCRFALPipeline.py -c ../config_files/bilstm-crf-al.config -t 1-2 --project 00001

frostjsy/AL-NER