This project provides a neural network based multi-task learning framework for biomedical named entity recognition (BioNER).
The implementation is based on the PyTorch library. Our model collectively trains different biomedical entity types to build a unified model that benefits the training of each single entity type and achieves a significantly better performance compared with the state-of-the-art BioNER systems.
For training, a GPU is strongly recommended for speed. CPU is supported but training could be extremely slow.
The code is based on PyTorch. You can find installation instructions here.
The code is written in Python 3.6. Its dependencies are summarized in the file requirements.txt
. You can install these dependencies like this:
pip3 install -r requirements.txt
To reproduce the results in our paper, you can first download the corpora and the embedding file here, unzip the folder data_bioner_5/
and put it under the main folder ./
. Then the following running script can be used to run the model.
./run_lm-lstm-crf5.sh
We use five biomedical corpora collected by Crichton et al. for biomedical NER. The dataset is publicly available and can be downloaded from here. The details of each dataset are listed below:
Dataset | Entity Type | Dataset Size |
---|---|---|
BC2GM | Gene/Protein | 20,000 sentences |
BC4CHEMD | Chemical | 10,000 abstracts |
BC5CDR | Chemical, Disease | 1,500 articles |
NCBI-disease | Disease | 793 abstracts |
JNLPBA | Gene/Protein, DNA, Cell Type, Cell Line, RNA | 2,404 abstracts |
In our paper, we merge the original training set and development set to be the new training set, as many teams did in the challenge. Some previous work (e.g., Luo et al., Bioinformatics 2017, Lu et al., Journal of cheminformatics 2015 and Leaman and Lu, Bioinformatics 2016) also preprocessed data in this way. If you want to reproduce our results, please follow the same way.
Users may want to use other datasets. We assume the corpus is formatted as same as the CoNLL 2003 NER dataset.
More specifically, empty lines are used as separators between sentences, and the separator between documents is a special line as below.
-DOCSTART- -X- -X- -X- O
Other lines contains words, labels and other fields. Word must be the first field, label must be the last. For example,
-DOCSTART- -X- -X- -X- O
Selegiline S-Chemical
- O
induced O
postural B-Disease
hypotension E-Disease
in O
Parkinson B-Disease
' I-Disease
s I-Disease
disease E-Disease
: O
a O
longitudinal O
study O
on O
the O
effects O
of O
drug O
withdrawal O
. O
We initialize the word embedding matrix with pre-trained word vectors from Pyysalo et al., 2013. These word vectors are trained using the skip-gram model on the PubMed abstracts together with all the full-text articles from PubMed Central (PMC) and a Wikipedia dump. You can download the embedding file here.
train_wc.py
is the script for our multi-task LSTM-CRF model.
The usages of it can be accessed by
python train_wc.py -h
The default running commands are:
python3 train_wc.py --train_file [training file 1] [training file 2] ... [training file N] \
--dev_file [developing file 1] [developing file 2] ... [developing file N] \
--test_file [testing file 1] [testing file 2] ... [testing file N] \
--caseless --fine_tune --emb_file [embedding file] --shrink_embedding --word_dim 200
Users may incorporate an arbitrary number of corpora into the training process. In each epoch, our model randomly selects one dataset i. We use training set i to learn the parameters and developing set i to evaluate the performance. If the current model achieves the best performance for dataset i on the developing set, we will then calculate the precision, recall and F1 on testing set i.
Here we compare our model with recent state-of-the-art models on the five datasets mentioned above. We use F1 score as the evaluation metric.
Model | BC2GM | BC4CHEMD | BC5CDR | NCBI-disease | JNLPBA |
---|---|---|---|---|---|
Dataset Benchmark | - | 88.06 | 86.76 | 82.90 | 72.55 |
Crichton et al. 2016 | 73.17 | 83.02 | 83.90 | 80.37 | 70.09 |
Lample et al. 2016 | 80.51 | 87.74 | 86.92 | 85.80 | 73.48 |
Ma and Hovy 2016 | 78.48 | 86.84 | 86.65 | 82.62 | 72.68 |
Liu et al. 2018 | 80.00 | 88.75 | 86.96 | 83.92 | 72.17 |
Our Model | 80.74 | 89.37 | 88.78 | 86.14 | 73.52 |
Our train_wc.py
provides an option to directly output the annotation results during the training process by the parameter --output_annotation
, i.e.,
python3 train_wc.py --train_file [training file 1] [training file 2] ... [training file N] \
--dev_file [developing file 1] [developing file 2] ... [developing file N] \
--test_file [testing file 1] [testing file 2] ... [testing file N] \
--caseless --fine_tune --emb_file [embedding file] --shrink_embedding --output_annotation --word_dim 200 --gpu 0
If users do not use --output_annotation
, the best performing model during the training process will be saved in ./checkpoint/
.
We have released our pre-trained model. You can download the Arg file and the Model file and put them in ./checkpoint/
.
Using the saved model, seq_wc.py
can be applied to annotate raw text. Its usage can be accessed by command
python seq_wc.py -h
and a running command example is provided below:
python3 seq_wc.py --load_arg checkpoint/cwlm_lstm_crf.json --load_check_point checkpoint/cwlm_lstm_crf.model --input_file test.tsv --output_file annotate/output --gpu 0
The annotation results will be in ./annotate/
.
The input format is similar to CoNLL, but each line is required to contain only one field, token. For example, an input file could be:
The
severe
anemia
(
hemoglobin
1
.
2
g
/
dl
)
appeared
to
be
the
primary
etiologic
factor
.
and the corresponding output is:
The O
severe O
anemia O
( O
hemoglobin B-GENE
1 I-GENE
. I-GENE
2 I-GENE
g I-GENE
/ I-GENE
dl E-GENE
) O
appeared O
to O
be O
the O
primary O
etiologic O
factor O
. O
If you find the implementation useful, please cite the following paper:
@article{wang2018cross,
title={Cross-type biomedical named entity recognition with deep multi-task learning},
author={Wang, Xuan and Zhang, Yu and Ren, Xiang and Zhang, Yuhao and Zitnik, Marinka and Shang, Jingbo and Langlotz, Curtis and Han, Jiawei},
journal={Bioinformatics},
volume={35},
number={10},
pages={1745--1752},
year={2019},
publisher={Oxford University Press}
}