We recently published HunFlair, a reimplementation of HUNER inside the Flair framework. By using language models, HunFlair considerably outperforms HUNER. In addition, as part of Flair, HunFlair is easy to install and does not have a dependency on Docker. We recommend all HUNER users to migrate to HunFlair.
HUNER is a state-of-the-art NER model for biomedical entities. It comes with models for genes/proteins, chemicals, diseases, species and cell lines.
The code is based on the great LSTM-CRF NER tagger implementation glample/tagger by Guillaume Lample.
Section | Description |
---|---|
Installation | How to install HUNER |
Usage | How to use HUNER |
Models | Available pretrained models |
Corpora | The HUNER Corpora |
- Install docker
- Clone this repository to
$dir
- Download the pretrained model you want to use from here, place it into
$dir/models/$model_name
and untar it usingtar xzf $model_name
To tokenize, sentence split and tag a file INPUT.TXT:
- Start the HUNER server from
$dir
using./start_server $model_name
. The model must reside in the directory$dir/models/$model_name
. - Tag text with
python client.py INPUT.TXT OUTPUT.CONLL --name $model_name
.
the output will then be written to OUTPUT.CONLL in the conll2003 format.
The options for client.py
are:
--asume_tokenized
: The input is already pre-tokenized and the tokens are separated by whitespace--assume_sentence_splitted
: The input is already split into sentences and each line of the input contains one sentence
The steps to fine-tune a base-model $base_model
(e.g. gene_all
) on a new corpus $corpus
are:
- Copy the chosen base-model to a new directory, because the weight files will be updated during fine-tuning:
cp $dir/models/$base_model $dir/models/$fine_tuned_model
- Convert your corpus to conll format and split it into
train
,dev
andtest
portions. If you don't want to use either dev or test data you can just provide the training data asdev
ortest
. Note however, that without dev data, results will probably suffer, because early-stopping can't be performed. - Fine-tune the model:
./train.sh $fine_tuned_model $corpus_train $corpus_dev $corpus_test
After successful training, $fine_tuned_model
will contain the fine-tuned model and can be used exactly like the models provided by us.
To train a model from scratch without initializing it from a base-model, proceed as follows:
- Convert your corpus to conll format and split it into
train
,dev
andtest
portions. If you don't want to use either dev or test data you can just provide the training data asdev
ortest
. Note however, that without dev data, results will probably suffer, because early-stopping can't be performed. - Train the model:
./train_no_finetune.sh $corpus_train $corpus_dev $corpus_test
After sucessful training, the model can be found in a newly created directory in models/
. The directory name reflects the chosen hyper-parameters and usually reads like tag_scheme=iob,lower=False,zeros=False,char_dim=25...
.
Model | Test sets P / R / F1 (%) | CRAFT P / R / F1 (%) |
---|---|---|
cellline_all | 70.40 / 65.37 / 67.76 | - |
chemical_all | 83.34 / 80.26 / 81.71 | 53.56 / 35.85 / 42.95 |
disease_all | 75.01 / 77.71 / 76.20 | - |
gene_all | 72.33 / 76.28 / 73.97 | 59.67 / 65.98 / 62.66 |
species_all | 77.88 / 74.86 / 73.33 | 98.51 / 73.83 / 84.40 |
For details and instructions on the HUNER corpora please refer to https://github.com/hu-ner/huner/tree/master/ner_scripts and the corresponding readme.
Please use the following bibtex entry:
@article{weber2019huner,
title={HUNER: Improving Biomedical NER with Pretraining},
author={Weber, Leon and M{\"u}nchmeyer, Jannes and Rockt{\"a}schel, Tim and Habibi, Maryam and Leser, Ulf},
journal={Bioinformatics},
year={2019}
}