BioNER

This repository contains the code for BioNER, an LSTM-based model designed for biomedical named entity recognition (NER).

Download

We provide the model trained for the following datasets:

Dataset	Mirror (Siasky)	Mirror (Mega)
MedMentions full	Download Model	Download Model
MedMentions ST21pv	Download Model	Download Model
JNLPBA	Download Model	Download Model

In addition, the word embeddings trained with fastText on PubMed Baseline 2021 are provided for the following n-gram ranges:

n-gram range	Mirror (Siasky)	Mirror (Mega)	Mirror (Storj)
3-4	Download	Download	Download
3-6	Download	Download	Download

Installation

Install the dependencies.

pip install -r requirements.txt

As deterministic behaviour is enabled by default, you may need to set a debug environment variable CUBLAS_WORKSPACE_CONFIG to prevent RuntimeErrors when using CUDA.

export CUBLAS_WORKSPACE_CONFIG=:4096:8

Usage

Dataset Preprocessing

BioNER expects a dataset in the CoNLL-2003 format. We used the tool bconv for preprocessing the MedMentions dataset.

Training

You can either use the provided Makefile to train the BioNER model or execute train_bioner.py directly. Makefile: Don't forget to fill in the empty fields in the Makefile before the first start.

make train-bioner ngrams=3-4

Annotation

You can annotate a CoNLL-2003 dataset in the following way:

python annotate_dataset.py \
--embeddings \ # path to the word embeddings file 
--dataset \ # path to the CoNLL-2003 dataset
--outputFile \ # path to the output file for storing the annotated dataset
--model # path to the trained BioNER model

Furthermore, you can add the flag --enableExportCoNLL to export an additional file at the same location at the same parent folder as the outputFile, which can be used for the evaluation with the original conlleval.pl perl script (source).