/bio-ner

Biomedical Named Entity Recognition and Normalization of Diseases, Chemicals and Genenetic entity classes through the use of state-of-the-art models.

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

bioNLP System for BioNER and BioNEN

                                                        Website Docker Python

Biomedical Named Entity Recognition and Normalization of Diseases, Drugs, Genes and Proteins entity classes through the use of state-of-the-art techniques. It is based on the BioBERT language model to identify entities. Normalization step is performed through inverse index search in a Solr database.

System Set-up

Package bionlp is mainly proposed to be used as part of the webpage or the annotation of CORD-19. In its dockerized versions these requirements are already satisfied. If it was desired to use it separately, the following dependencies must be satisfied:

  • transformers>=4.5.0
  • spacy>=3
  • pysolr~=3.9.0
  • torch

bionlp package can be found on bio-nlp/bionlp

Solr Database for Normalization

A Solr database must be setup and populated in localhost before the following deployments since normalization step will need this database. Details about this configuration are found in bio-nlp/Solr.

Web Page

Available at: https://librairy.github.io/bio-ner/.

The webpage allows to easily use the system just pasting the text we want to process and clicking analyze button. This data will be sent through an AJAX call to the system which will return the data annotated and normalized in the following views:

Results Annotation

Annotated results will be represented in coloured boxes where each box represents one entity class.

Results Normalized

Normalized results will appear in a table for each of the entity classes. The found term will be retrieved along with the ids stored in a Solr Database. An extra Table will appear if COVID related terms appear in the processed text regarding to drug target evidences or related proteins.

Results in JSON

In order to ease the later use of the retrieved information a Json text box is also established.

Web deployment

This web platform can be easily deployed thanks to its dockerization. Docker image can be found on Docker Hub: https://hub.docker.com/r/alvaroalon2/webapp_bionlp. Docker image includes the models within the image. If it was not wanted to use the provided online Solr database endpoint 'http://librairy.linkeddata.es/solr/', then the environment variable should not be passed in docker run.

GPU Support

The Docker Nvidia Toolkit is needed for GPU support inside Docker containers with NVIDIA GPUs. The deployment can be performed as follows:

  1. docker pull alvaroalon2/webapp_bionlp:gpu
  2. docker run --name webapp -it --gpus all --network 'host' -e SOLR_URL="http://librairy.linkeddata.es/solr/" alvaroalon2/webapp_bionlp:gpu

CPU Support

If a GPU is not available, deployment can be done also on CPU. If it is the case, it is recommended to use the CPU dockerized version instead of GPU:

  1. docker pull alvaroalon2/webapp_bionlp:cpu
  2. docker run --name webapp -it --network 'host' -e SOLR_URL="http://librairy.linkeddata.es/solr/" alvaroalon2/webapp_bionlp:cpu

CORD-19 Annotation

The proposed system has been used in a real-world use-case: Processing of CORD-19 corpus which contains more than 300K coronavirus related articles. For this purpose, the corpus has been previously pre-processed, to separate it on paragraphs, and loaded in a Solr database with the use of https://github.com/librairy/cord-19. In order to ease the use of this annotation the use of the dockerized version is recommended. The repository on Docker Hub for this docker image can be found on:https://hub.docker.com/r/alvaroalon2/bionlp_cord19_annotation. Docker image includes the models within the image. If it was not wanted to use the provided online Solr database endpoint 'http://librairy.linkeddata.es/solr/', then the environment variable should not be passed in docker run.

GPU Support

The Docker Nvidia Toolkit is needed for GPU support inside Docker containers with NVIDIA GPUs. Steps for running the container and itialize its anotation and normalization are as follows:

  1. docker pull alvaroalon2/bionlp_cord19_annotation:gpu
  2. docker run --name annotation -it --gpus all --network 'host' -e SOLR_URL="http://librairy.linkeddata.es/solr/" alvaroalon2/bionlp_cord19_annotation:gpu

CPU Support

If a GPU is not available, deployment can be done also on CPU. ANnotation will be substantially slower. If it is the case it is recommended to use the CPU dockerized version instead of GPU:

  1. docker pull alvaroalon2/bionlp_cord19_annotation:cpu
  2. docker run --name annotation -it --network 'host' -e SOLR_URL="http://librairy.linkeddata.es/solr/" alvaroalon2/bionlp_cord19_annotation:cpu

Models

One model was proposed for each of the entity classes: Diseases, Chemicals and Genenetic. Therefore, the final system is composed by three models in which each of them carries out the annnotation of its proper entity class. System will automatically check if models have been previously stored in its proper folder. If the model is missing an automatical download of a cached version is download from its proper Huggingface repository where proposed models were uploaded. These are the repositories for the proposed models:

Further details are described on: bio-nlp/models. Models could be leveraged in other required systems if desired.

Fine-tuning

Fine-tuning process was done in Google Collab using a TPU. For that purpose Fine_tuning.ipynb Jupyter Notebook is proposed which make use of the scripts found on bio-nlp/fine-tuning which has been partially adapted from the originally proposed in BioBERT repository in order to allow TPU execution and the use of a newer version of huggingface-transformers.

Embedding visualization

Details about visualization can be found on bio-nlp/Embeddings along with an example.