/biobert

Bioinformatics'2020: BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Primary LanguagePythonOtherNOASSERTION

BioBERT

This repository provides the code for fine-tuning BioBERT, a biomedical language representation model designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. Please refer to our paper BioBERT: a pre-trained biomedical language representation model for biomedical text mining for more details. This project is done by DMIS-Lab.

Download

We provide five versions of pre-trained weights. Pre-training was based on the original BERT code provided by Google, and training details are described in our paper. Currently available versions of pre-trained weights are as follows:

Alternately, you can download pre-trained weights from here

Installation

Sections below describe the installation and the fine-tuning process of BioBERT based on Tensorflow 1 (python version <= 3.7). For PyTorch version of BioBERT, you can check out this repository. If you are not familiar with coding and just want to recognize biomedical entities in your text using BioBERT, please use this tool which uses BioBERT for multi-type NER and normalization.

To fine-tune BioBERT, you need to download the pre-trained weights of BioBERT. After downloading the pre-trained weights, use requirements.txt to install BioBERT as follows:

$ git clone https://github.com/dmis-lab/biobert.git
$ cd biobert; pip install -r requirements.txt

Note that this repository is based on the BERT repository by Google. All the fine-tuning experiments were conducted on a single TITAN Xp GPU machine which has 12GB of RAM. You might want to install java to use the official evaluation script of BioASQ. See requirements.txt for other details.

Quick Links

Link Detail
BioBERT-PyTorch PyTorch-based BioBERT implementation
BERN Web-based biomedical NER + normalization using BioBERT
covidAsk BioBERT based real-time question answering model for COVID-19
7th BioASQ Code for the seventh BioASQ challenge winning model (factoid/yesno/list)
Paper Paper link with BibTeX (Bioinformatics)

FAQs

Datasets

We provide a pre-processed version of benchmark datasets for each task as follows:

You can simply run download.sh to download all the datasets at once.

$ ./download.sh

This will download the datasets under the folder datasets. Due to the copyright issue of other datasets, we provide links of those datasets instead: 2010 i2b2/VA, ChemProt.

Fine-tuning BioBERT

After downloading one of the pre-trained weights, unpack it to any directory you want, and we will denote this as $BIOBERT_DIR. For instance, when using BioBERT-Base v1.1 (+ PubMed 1M), set BIOBERT_DIR environment variable as:

$ export BIOBERT_DIR=./biobert_v1.1_pubmed
$ echo $BIOBERT_DIR
>>> ./biobert_v1.1_pubmed

Named Entity Recognition (NER)

Let $NER_DIR indicate a folder for a single NER dataset which contains train_dev.tsv, train.tsv, devel.tsv and test.tsv. Also, set $OUTPUT_DIR as a directory for NER outputs (trained models, test predictions, etc). For example, when fine-tuning on the NCBI disease corpus,

$ export NER_DIR=./datasets/NER/NCBI-disease
$ export OUTPUT_DIR=./ner_outputs

Following command runs fine-tuning code on NER with default arguments.

$ mkdir -p $OUTPUT_DIR
$ python run_ner.py --do_train=true --do_eval=true --vocab_file=$BIOBERT_DIR/vocab.txt --bert_config_file=$BIOBERT_DIR/bert_config.json --init_checkpoint=$BIOBERT_DIR/model.ckpt-1000000 --num_train_epochs=10.0 --data_dir=$NER_DIR --output_dir=$OUTPUT_DIR

You can change the arguments as you want. Once you have trained your model, you can use it in inference mode by using --do_train=false --do_predict=true for evaluating test.tsv. The token-level evaluation result will be printed as stdout format. For example, the result for NCBI-disease dataset will be like this:

INFO:tensorflow:***** token-level evaluation results *****
INFO:tensorflow:  eval_f = 0.8972311
INFO:tensorflow:  eval_precision = 0.88150835
INFO:tensorflow:  eval_recall = 0.9136615
INFO:tensorflow:  global_step = 2571
INFO:tensorflow:  loss = 28.247158

(tips : You should go up a few lines to find the result. It comes before INFO:tensorflow:**** Trainable Variables **** )

Note that this result is the token-level evaluation measure while the official evaluation should use the entity-level evaluation measure. The results of python run_ner.py will be recorded as two files: token_test.txt and label_test.txt in $OUTPUT_DIR. Use ./biocodes/ner_detokenize.py to obtain word level prediction file.

$ python biocodes/ner_detokenize.py --token_test_path=$OUTPUT_DIR/token_test.txt --label_test_path=$OUTPUT_DIR/label_test.txt --answer_path=$NER_DIR/test.tsv --output_dir=$OUTPUT_DIR

This will generate NER_result_conll.txt in $OUTPUT_DIR. Use ./biocodes/conlleval.pl for entity-level exact match evaluation results.

$ perl biocodes/conlleval.pl < $OUTPUT_DIR/NER_result_conll.txt

The entity-level results for the NCBI disease corpus will be like:

processed 24497 tokens with 960 phrases; found: 983 phrases; correct: 852.
accuracy:  98.49%; precision:  86.67%; recall:  88.75%; FB1:  87.70
             MISC: precision:  86.67%; recall:  88.75%; FB1:  87.70  983

Note that this is a sample run of an NER model. The performance of NER models usually converges at more than 50 epochs (learning rate = 1e-5 is recommended).

Relation Extraction (RE)

Let $RE_DIR indicate a folder for a single RE dataset, $TASK_NAME denote the name of task (two possible options: {gad, euadr}), and $OUTPUT_DIR denote a directory for RE outputs:

$ export RE_DIR=./datasets/RE/GAD/1
$ export TASK_NAME=gad
$ export OUTPUT_DIR=./re_outputs_1

Following command runs fine-tuning code on RE with default arguments.

$ python run_re.py --task_name=$TASK_NAME --do_train=true --do_eval=true --do_predict=true --vocab_file=$BIOBERT_DIR/vocab.txt --bert_config_file=$BIOBERT_DIR/bert_config.json --init_checkpoint=$BIOBERT_DIR/model.ckpt-1000000 --max_seq_length=128 --train_batch_size=32 --learning_rate=2e-5 --num_train_epochs=3.0 --do_lower_case=false --data_dir=$RE_DIR --output_dir=$OUTPUT_DIR

The predictions will be saved into a file called test_results.tsv in the $OUTPUT_DIR. Use ./biocodes/re_eval.py for the evaluation. Note that the CHEMPROT dataset is a multi-class classification dataset and to evaluate the CHEMPROT result, you should run re_eval.py with additional --task=chemprot flag.

$ python ./biocodes/re_eval.py --output_path=$OUTPUT_DIR/test_results.tsv --answer_path=$RE_DIR/test.tsv

The result for GAD dataset will be like this:

f1 score    : 83.74%
recall      : 90.75%
precision   : 77.74%
specificity : 71.15%

Please be aware that you have to change $OUTPUT_DIR to train/test a new model. For instance, as most RE datasets are in 10-fold, you have to make a different output directory to train/test a model for a different fold (e.g., $ export OUTPUT_DIR=./re_outputs_2).

Question Answering (QA)

To use the BioASQ dataset, you need to register in the BioASQ website which authorizes the use of the dataset. Please unpack the pre-processed BioASQ dataset provided above to a directory $QA_DIR. For example, with $OUTPUT_DIR for QA outputs, set as:

$ export QA_DIR=./datasets/QA/BioASQ
$ export OUTPUT_DIR=./qa_outputs

Files named as BioASQ-*.json are used for training and testing the model which are the pre-processed format for BioBERT. Note that we pre-trained our model on SQuAD dataset to get state-of-the-art performance (see here to get BioBERT pre-trained on SQuAD), and you might have to change $BIOBERT_DIR accordingly. Following command runs fine-tuning code on QA with default arguments.

$ python run_qa.py --do_train=True --do_predict=True --vocab_file=$BIOBERT_DIR/vocab.txt --bert_config_file=$BIOBERT_DIR/bert_config.json --init_checkpoint=$BIOBERT_DIR/model.ckpt-1000000 --max_seq_length=384 --train_batch_size=12 --learning_rate=5e-6 --doc_stride=128 --num_train_epochs=5.0 --do_lower_case=False --train_file=$QA_DIR/BioASQ-train-factoid-4b.json --predict_file=$QA_DIR/BioASQ-test-factoid-4b-1.json --output_dir=$OUTPUT_DIR

The predictions will be saved into a file called predictions.json and nbest_predictions.json in $OUTPUT_DIR. Run ./biocodes/transform_nbset2bioasqform.py to convert nbest_predictions.json to the BioASQ JSON format, which will be used for the official evaluation.

$ python ./biocodes/transform_nbset2bioasqform.py --nbest_path=$OUTPUT_DIR/nbest_predictions.json --output_path=$OUTPUT_DIR

This will generate BioASQform_BioASQ-answer.json in $OUTPUT_DIR. Clone evaluation code from BioASQ github and run evaluation code on Evaluation-Measures directory. Please note that you should always put 5 as parameter for -e.

$ git clone https://github.com/BioASQ/Evaluation-Measures.git
$ cd Evaluation-Measures
$ java -Xmx10G -cp $CLASSPATH:./flat/BioASQEvaluation/dist/BioASQEvaluation.jar evaluation.EvaluatorTask1b -phaseB -e 5 ../$QA_DIR/4B1_golden.json ../$OUTPUT_DIR/BioASQform_BioASQ-answer.json

As our model is only on factoid questions, the result will be like,

0.0 0.3076923076923077 0.5384615384615384 0.394017094017094 0.0 0.0 0.0 0.0 0.0 0.0

where the second, third and fourth numbers will be SAcc, LAcc, and MRR of factoid questions respectively. For list and yes/no type questions, please refer to our repository for BioBERT at the 7th BioASQ Challenge.

License and Disclaimer

Please see the LICENSE file for details. Downloading data indicates your acceptance of our disclaimer.

Citation

@article{lee2020biobert,
  title={BioBERT: a pre-trained biomedical language representation model for biomedical text mining},
  author={Lee, Jinhyuk and Yoon, Wonjin and Kim, Sungdong and Kim, Donghyeon and Kim, Sunkyu and So, Chan Ho and Kang, Jaewoo},
  journal={Bioinformatics},
  volume={36},
  number={4},
  pages={1234--1240},
  year={2020},
  publisher={Oxford University Press}
}

Contact Information

For help or issues using BioBERT, please submit a GitHub issue. Please contact Jinhyuk Lee (lee.jnhk (at) gmail.com), or Wonjin Yoon (wonjin.info (at) gmail.com) for communication related to BioBERT.