AL-NER
In this project, we use pre-trained word2vec to implement word embedding, choose BiLstm as encoder and CRF as decoder.To evaluate the active learning strategies, we also implement several sample selection strategies based on uncertainty.
Reference
Please cite the paper, if this project/paper contribute to your research.
@misc{liu2020ltp,
title={LTP: A New Active Learning Strategy for CRF-Based Named Entity Recognition},
author={Mingyi Liu and Zhiying Tu Tong Zhang and Tonghua Su and Zhongjie Wang},
year={2020},
eprint={2001.02524},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
You can find the latest version of this paper: https://www.researchgate.net/publication/338476927_LTP_A_New_Active_Learning_Strategy_for_CRF-Based_Named_Entity_Recognition
Note: Arxiv status still on hold..
Word Embedding
In this project, we use a 300d word embedding pre-trained on the Chinese Wikipedia corpus for the Chinese datasets, and a 100d glove word embedding pre-trained on the English Wikipedia corpus for the English. You can get them from the download link below, then you need to convert these files to .pkl files.
BiLstm-CRF
BiLstm-CRF has been widely used in named entity recognition on several typical datasets.
Sample selection strategies
In this project, the following selection strategies are implemented.
- RANDOM: RandomStrategy
- LC: LeastConfidenceStrategy
- NLC: NormalizedLeastConfidenceStrategy
- LTP: LeastTokenProbabilityStrategy
- MTP: MinimumTokenProbabilityStrategy
- MTE: MaximumTokenEntropyStrategy
- LONG: LongStrategy
- TE: TokenEntropyStrategy
Prerequisites
- python 3.6
- pytorch 1.5.1
- numpy 1.19.1
- sklearn 0.23.1
- seqeval
- colorama
Datasets
We have experimented and evaluate the active learning strategies mentioned above on four Chinese datasets and two english datasets.We get these datasets from the dounload link below, then carry out some data preprocessing operations on these files, such as dividing the extra long sentences through ','.
Dataset Struct
You can find some sample files which contain part of the datasets under the directory of datasets.In this project, we store datasets in the following structure.
datasets
|
| --- dataset1
| --- train.txt
| --- test.txt
| --- tags.txt
| --- dataset2
| --- train.txt
| --- test.txt
| --- tags.txt
| --- ....
Basic Description For Each Dataset
Name | Description | Language |
---|---|---|
People’s Daily | a collection of newswire article annotated with 3 balanced entity types | Chinese |
Boson_NER | a set of online news annotations published by bosonNLP, which contains 6 entity types | Chinese |
Weibo_NER | a collection of short blogs posted on Chinese social media Weibo with 8 extremely unbalanced entity types | Chinese |
OntoNotes-5.0 | a collection of broadcast news articles, which contains 18 entity types | Chinese |
CONLL2003 | a well known english dataset consists of Reuters news stories between August 1996 and August 1997, which contains 4 different entity types | English |
Ritter | a english dataset consist of tweets annotated with 10 different entity types | English |
Basic Statistics
Name | #S | #T | #E | ASL | ASE | AEL | %PT | %AC | %DAC |
---|---|---|---|---|---|---|---|---|---|
BosonNLP-train | 27350 | 409830 | 6 | 14.98 | 0.67 | 3.93 | 17.7% | 41.8% | 14.7% |
BosonNLP-test | 6825 | 99616 | 6 | 14.59 | 0.67 | 3.87 | 17.8% | 41.8% | 14.8% |
Weibo_NER-train | 3664 | 85571 | 8 | 23.35 | 0.62 | 2.60 | 6.9% | 33.6% | 14.8% |
Weibo_NER-test | 591 | 13810 | 8 | 23.36 | 0.66 | 2.60 | 7.3% | 36.3% | 17.7% |
OntoNotes5.0_NER-train | 13798 | 362508 | 18 | 26.27 | 1.91 | 3.14 | 22.8% | 72.5% | 48.0% |
OntoNotes5.0_NER-test | 1710 | 44790 | 18 | 26.19 | 1.99 | 3.07 | 23.4% | 75.4% | 51.5% |
PeopleDaily-train | 50658 | 2169879 | 3 | 42.83 | 1.47 | 3.23 | 11.1% | 58.3% | 35.8% |
PeopleDaily-test | 4620 | 172590 | 3 | 37.35 | 1.33 | 3.25 | 11.6% | 54.4% | 29.1% |
CONLL2003-train | 13862 | 203442 | 4 | 14.67 | 1.69 | 1.44 | 16.7% | 79.9% | 44.2% |
CONLL2003-test | 3235 | 51347 | 4 | 15.87 | 1.83 | 1.44 | 16.7% | 80.4% | 48.8% |
Ritter-train | 1955 | 37735 | 10 | 19.30 | 0.62 | 1.65 | 5.3% | 38.1% | 15.3% |
Ritter-test | 438 | 8733 | 10 | 19.93 | 0.60 | 1.62 | 4.9% | 39.2% | 15.5% |
Usage
- Modify the configuration file as required (al-ner-demo/config_files/bilstm-crf-al.config)
[LOGGER]
logdir_prefix=../logger # please make sure that this directory exists
[WORDEMBEDDING]
method=WORD2VEC
[MODELTRAIN]
method=BiLSTMCRF
[WORD2VEC]
all_word_embedding_path=../embedding/merge_sgns_bigram_char300.pkl
choose_fraction=0.01
courpus_file=../datasets/BosonNLP_NER_6C/
courpus_name=BosonNLP_NER_6C
embedding_dim=300
entity_type=6
max_seq_len=64
tags_file=../datasets/BosonNLP_NER_6C/tags.txt
[BiLSTMCRF]
batch_size=64
device=cuda:0
embedding_dim=300
hidden_dim=200
num_rnn_layers=1
num_epoch=25
learning_rate=1e-3
model_path_prefix=../model/word2vec_bilstm_crf_ltp # please make sure that ../model exists
[ENTITYLEVELF1]
average=micro
digits=2
return_report=False
[ActiveStrategy]
strategy=LTP #other options:RANDOM,LC,NLC,MNLP,MTP,MTE,LONG,TE
stop_echo=25
query_batch_fraction=0.02
According to the above configuration file, log file will be saved under the directory below.
AL-NER/logger/BosonNLP_NER_6C/WORD2VEC_BiLSTMCRF_LTP/
- Type the command line and try to run it
cd al-ner-demo/pipelines/
python -u Word2VecBiLSTMCRFALPipeline.py -c ../config_files/bilstm-crf-al.config -t 1-2 --project 00001