/CollaboNet

CollaboNet for Biomedical Named Entity Recognition

Primary LanguagePythonMIT LicenseMIT

CollaboNet: collaboration of deep neural networks for biomedical named entity recognition

This project provides a neural network(bi-LSTM + CRF) approach for biomedical Named Entity Recognition.
Our implementation is based on the Tensorflow library on python.

  • TITLE : CollaboNet: collaboration of deep neural networks for biomedical named entity recognition
    * Accepted for CIKM 2018 workshop - ACM 12th International Workshop on Data and Text Mining in Biomedical Informatics (DTMBIO2018).
  • AUTHOR : Wonjin Yoon1!, Chan Ho So2!, Jinhyuk Lee1 and Jaewoo Kang1*
    • Author details
      1 Department of Computer Science and Engineering, Korea University
      2 Interdisciplinary Graduate Program in Bioinformatics, Korea University
      ! Equal contributor

Quick Links

Requirements

At least one CUDA compatible GPU device is strongly recommanded for execution of this project codes.
python 2.7
numpy 1.14.2
tensorflow-gpu 1.7.0

License

The code is distributed under MIT license.
Citeable paper can be found at pre-print server [here]

This software includes third party software.
See License-thirdparty.txt for details.

Model

[LEFT] Character level word embedding using CNN and overview of Bidirectional LSTM with Conditional Random Field (BiLSTM-CRF).
[RIGHT] Structure of CollaboNet when Gene model act as a role of target model. Rhombus represents the CRF layer. Arrows show the flow of information when target model is training. Dashed arrows mean that information is not flowing when target model is under training.
Model

Data

Train, Test Data

We used datasets collected by Crichton et al.
These datasets by Crichton et al. are available here.
We found that the JNLPBA dataset from Crichton et al. contains sentences which were incorrectly split.
So we re-generated the dataset from the original corpus by Kim et al..

The details of each dataset is showed below:

Corpora Entity type No. sentence No. annotations Data Size
NCBI-Disease (Dogan et al., 2014) Disease 7,639 6,881 793 abstracts
JNLPBA (Kim et al., 2004) Gene/Proteins 22,562 35,336 2,404 abstracts
BC5CDR (Li et al., 2016) Chemicals 14,228 15,935 1,500 articles
BC5CDR (Li et al., 2016) Diseases 14,228 12,852 1,500 articles
BC4CHEMD (Krallinger et al., 2015a) Chemicals 86,679 84,310 10,000 abstracts
BC2GM (Akhondi et al., 2014) Gene/Proteins 20,510 24,583 20,000 sentences

The datasets are publicly available by executing download.sh.

Pre-trained Embeddings

We used pre-trained word embeddings from Pyysalo et al. which is trained on PubMed, PubMed Central(PMC) and Wikipedia text. It will be automatically downloaded by executing download.sh.

Usage

Download Data

bash download.sh

Single Task Model [STM] (6 datasets)

Preperation phase (Phase 0) of CollaboNet

python run.py --ncbi --jnlpba --bc5_chem --bc5_disease --bc4 --bc2 --lr_pump --lr_decay 0.05

You can also refer to stm.sh for detailed usage.

CollaboNet (6 datasets)

You should produce pre-trained STM model by executing Preperation phase before running CollaboNet.

python run.py --ncbi --jnlpba --bc5_chem --bc5_disease --bc4 --bc2 --lr_pump --lr_decay 0.05 --pretrained STM_MODEL_DIRECTORY_NAME(ex 201806210605)

You can find STM_MODEL_DIRECTORY_NAME from ./modelSave folder.
You can also refer to collabo.sh for detailed usage.

Performance

STM

Model NCBI-disease JNLPBA BC5CDR-chem BC5CDR-disease BC4CHEMD BC2GM Average
Habibi et al. (2017) STM F1 Score 84.44 77.25 90.63 83.49 86.62 77.82 83.38
Wang et al. (2018) STM F1 Score 83.92 72.17 *89.85 *82.68 88.75 80.00 82.90
Our STM F1 Score 84.69 77.39 92.74 82.61 88.40 79.27 84.03
  • Scores in the asterisked (*) cells are obtained in the experiments that we conducted; these scores are not reported in the original papers.
  • The best scores from these experiments are in bold.

CollaboNet

NCBI-disease JNLPBA BC5CDR-chem BC5CDR-disease BC4CHEMD BC2GM Average
Wang et al. (2018) MTM F1 Score 86.14 73.52 *91.29 *83.33 89.37 80.74 84.07
Our CollaboNet F1 Score 86.36 78.58 93.31 84.08 88.85 79.73 85.15
  • Scores in the asterisked (*) cells are obtained in the experiments that we conducted; these scores are not reported in the original papers.
  • The best scores from these experiments are in bold.