DTranNER is a deep-learning-based method suited for biomedical named entity recognition that obtains state-of-the-art performance in NER on the five biomedical benchmark corpora (BC2GM, BC4CHEMD, BC5CDR-disease, BC5CDR-chemical, and NCBI-Diesease). DTranNER equips with deep learning-based label-label transition model to describe ever-changing contextual relations between neighboring labels. Please refer to our paper DTranNER: biomedical named entity recognition with deep learning-based label-label transition model for more details.
- (29 August 2019) A new version of DTranNER is now available. It is entirely renewed, based on PyTorch, with providing significant performance improvements over the scores on the submitted manuscript.
To use DTranNER, you are required to set up a python3-based environment with packages such as pytorch v1.1.0, numpy, and gensim.
Download the specified word embedding (wikipedia-pubmed-and-PMC-w2v.bin
) on here and put it under the directory w2v
whose location is under the project-root directory.
mkdir w2v
mv wikipedia-pubmed-and-PMC-w2v.bin $PROJECT_ROOT/w2v/
For model training, we recommend using GPU.
python train.py \
--DTranNER
--dataset_name ['BC5CDR-disease','BC5CDR-chem','BC2GM','BC4CHEMD',or 'NCBI-disease'] \
--hidden_dim [e.g., 500] \
--pp_hidden_dim [e.g., 500] \
--bilinear_dim [e.g., 500] \
--pp_bilinear_pooling
--gpu [e.g., 0]
You can change the arguments as you want.
We initialize the word embedding matrix with the pre-trained word vectors from Pyysalo et al., 2013. These word vectors are obtained from here. They were trained using the PubMed abstracts, PubMed Central (PMC), and a Wikipedia dump. Recently, contextualized word embeddings have been emerged. We incorporated ELMo https://arxiv.org/abs/1802.05365 into our token embedding layer.
The source of pre-processed datasets are from https://github.com/cambridgeltl/MTL-Bioinformatics-2016 and
We use biomedical corpora collected by Crichton et al. The dataset is publicly available and can be downloaded from here. In our implementation, the datasets are accessed via $PROJECT_HOME/data/
. For details on NER datasets, please refer to A Neural Network Multi-Task Learning Approach to Biomedical Named Entity Recognition (Crichton et al. 2017).
In this study, we use IOBES tagging scheme. O
denotes non-entity token, B
denotes the first token of such an entity consisting of multiple tokens, I
denotes the inside token of the entity, E
denotes the last token, and S
denotes a single-token-based entity. We are conducting experiments with IOB tagging scheme at this moment. It will be reported soon.
Here we compare our model with recent state-of-the-art models on the five biomedical corpora mentioned above. We measure F1 score as the evaluation metric. The experimental results are shown in below the table.
Model | BC2GM | BC4CHEMD | BC5CDR-Chemical | BC5CDR-Disease | NCBI-disease |
---|---|---|---|---|---|
Att-BiLSTM-CRF 2017 | - | 91.14 | 92.57 | - | - |
D3NER 2018 | - | - | 93.14 | 84.68 | 84.41 |
Collabonet 2018 | 79.73 | 88.85 | 93.31 | 84.08 | 86.36 |
Wang et al. 2018 | 80.74 | 89.37 | 93.03 | 84.95 | 86.14 |
BioBERT v1.0 | 84.40 | 91.41 | 93.44 | 86.56 | 89.36 |
BioBERT v1.1 | 84.72 | 92.36 | 93.47 | 87.15 | 89.71 |
DTranNER | 84.56 | 91.99 | 94.16 | 87.22 | 88.62 |
Please post a Github issue or contact skhong831@kaist.ac.kr or skhong0831@gmail.com if you have any questions.