Santosh Tirunagari, Melissa Harrison
Biomedical Named Entity Recognition (NER) poses a significant challenge in biomedical information processing due to the extensive lexical variations and ambiguity of out-of-context terms. Recent advancements in models such as BERT, GPT, and LLMs have shown improved performance on bioNER benchmarks. However, these models often demand substantial computational resources for production. We introduce the quantised_epmca_bioformer-8L (QEB8L) model, trained on the Europe PMC fully annotated corpus for genes/proteins, diseases, chemicals, and organisms. The QEB8L model leverages the ONNX runtime and is quantised, resulting in a lighter (77MB) and faster inference process. It achieves comparable results to Biobert but exhibits a remarkable 10x speed improvement on 2-core CPU machines with 1 GB RAM.
A comprehensive, step-by-step guide for running the QEB8L model and setting up the required environment.
To utilize the QEB8L model for Biomedical Named Entity Recognition, follow the steps below to install Python3, Pip3, and create a virtual environment:
-
Install Python3:
- Open the terminal.
- Update package lists:
sudo apt update
. - Install Python3:
sudo apt install python3
.
-
Install Pip3:
- Run:
sudo apt install python3-pip
.
- Run:
-
Install Virtualenv:
- Execute:
pip3 install virtualenv
.
- Execute:
-
Create a virtual environment:
- Navigate to the desired directory in the terminal.
- Run:
virtualenv myenv
.
-
Activate the virtual environment:
- In the terminal, navigate to the virtual environment's directory.
- Execute:
source myenv/bin/activate
.
-
Install the required Python packages:
- In the activated virtual environment, run the following commands to install the required packages:
pip install optimum==1.8.8
pip install onnx==1.13.1
pip install onnxruntime==1.15.1
Once the virtual environment is activated, follow these steps to load and utilise the QEB8L model:
Download/clone the model from the repo and specify the path to the downloaded model folder by creating a variable in Python
quantised_path = folder_where_the_model_is_located
- Import the required libraries:
from optimum.pipelines import pipeline
from functools import partial
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForTokenClassification
from optimum.onnxruntime.configuration import AutoQuantizationConfig, AutoCalibrationConfig
- Load the quantized model and tokenizer:
model_quantized = ORTModelForTokenClassification.from_pretrained(quantised_path, file_name="model_quantized.onnx")
tokenizer_quantized = AutoTokenizer.from_pretrained(quantised_path, model_max_length=512, batch_size=8, truncation=True)
- Create a pipeline for token classification:
ner_quantized = pipeline("token-classification", model=model_quantized, tokenizer=tokenizer_quantized, aggregation_strategy="first")
- Provide a sample text for Named Entity Recognition:
text = '''CLASS Omicron is a variant of SARS-CoV-2 first reported to the World Health Organization by the Network for Genomics Surveillance in South Africa on 24 November 2021. It was first detected in Botswana and has spread to become the predominant variant in circulation around the world.. The SARS-CoV-2 uses ACE2 to infect target cells and the expression of the ACE2 levels are increased following treatment with angiotensin-converting enzyme inhibitors (ACEIs); in addition, angiotensin receptor blockers (ARBs) has emerged speculation that patients with COVID-19 receiving these drugs may be under at a potentially increased risk for developing severe and fatal illness [11, 12]. Interestingly, when looking at ACE2 expression in different pathological stages (1 to 4), no differences was observed in any of the two lung cancer types (Figure 1C-D), suggesting stage might not the factor affecting ACE2 expression in lung tumor and therefore no significant differences in the susceptibility to SARS-CoV-2 infection among the pathological stages for LUAD and LUSC patients. Chronic kidney disease (CKD) is a global public health problem, and its prevalence is gradually increasing, mainly due to an increase in the number of patients with type 2 diabetes mellitus (T2DM) [1,2,3,4]. Human multidrug and toxin extrusion member 2 (MATE2-K, SLC47A2) plays an important role in the renal elimination of various clinical drugs including the antidiabetic drug metformin. The goal of this study was to characterize genetic variants of MATE2-K and determine their association with the pharmacokinetics of metformin'''
- Perform Named Entity Recognition:
pred = ner_quantized(text)
- Visualize the extracted entities:
for ent in pred:
print([ent['start'], ent['end'], text[ent['start']:ent['end']], ent['entity_group'], ent['score']])
The output is listed in the following format: [start_span,end_span,entity,entity_type,score].
The entity types are as follows: 'GP': Gene/Protein, 'CD': Chemical/Drug, 'OG': Organism, and 'DS': Disease. This format allows you to identify the start and end positions of the entity in the text, the entity itself, its corresponding entity type, and the associated score.
[30, 40, 'SARS-CoV-2', 'OG', 0.98088056]
[288, 298, 'SARS-CoV-2', 'OG', 0.9897682]
[304, 308, 'ACE2', 'GP', 0.9994]
[358, 362, 'ACE2', 'GP', 0.9993819]
[409, 438, 'angiotensin-converting enzyme', 'GP', 0.9984209]
[472, 492, 'angiotensin receptor', 'GP', 0.99874496]
[552, 560, 'COVID-19', 'DS', 0.99102575]
[709, 713, 'ACE2', 'GP', 0.9994085]
[814, 825, 'lung cancer', 'DS', 0.9988202]
[895, 899, 'ACE2', 'GP', 0.9993438]
[914, 924, 'lung tumor', 'DS', 0.9986173]
[991, 1011, 'SARS-CoV-2 infection', 'DS'0.9965901]
[1046, 1050, 'LUAD', 'DS', 0.9971084]
[1055, 1059, 'LUSC', 'DS', 0.99671656]
[1070, 1092, 'Chronic kidney disease', 'DS', 0.998764]
[1094, 1097, 'CKD', 'DS', 0.9988304]
[1235, 1259, 'type 2 diabetes mellitus', 'DS', 0.99851716]
[1261, 1265, 'T2DM', 'DS', 0.9988927]
[1279, 1284, 'Human', 'OG', 0.99373156]
[1285, 1323, 'multidrug and toxin extrusion member 2', 'GP', 0.99628574]
[1325, 1332, 'MATE2-K', 'GP', 0.99739236]
[1334, 1341, 'SLC47A2', 'GP', 0.9994468]
[1450, 1459, 'metformin', 'CD', 0.99891806]
[1524, 1531, 'MATE2-K', 'GP', 0.99751383]
[1593, 1602, 'metformin', 'CD', 0.9987571]
Metric | BioBert (CD) | QEB8L (CD) | BioBert (DS) | QEB8L (DS) | BioBert (OG) | QEB8L (OG) | BioBert (GP) | QEB8L (GP) |
---|---|---|---|---|---|---|---|---|
Precision | 0.91 | 0.85 | 0.90 | 0.90 | 0.93 | 0.94 | 0.91 | 0.90 |
Recall | 0.92 | 0.90 | 0.80 | 0.88 | 0.86 | 0.85 | 0.87 | 0.88 |
F1 Score | 0.92 | 0.88 | 0.85 | 0.89 | 0.90 | 0.89 | 0.89 | 0.89 |
we present the Quantised EPMCA Bioformer-8L (QEB8L) model for Biomedical Named Entity Recognition. By utilizing the Onnx runtime and quantisation techniques, we achieved a faster and lighter model without compromising performance. The results demonstrate comparable performance to Biobert but with a significant speed improvement.
- APA
Tirunagari, S., & Harisson, M. (2023). Accelerating Biomedical Named Entity Recognition with Quantised EPMCA Bioformer-8L (QEB8L) Model (Version 0.0.1) [Computer software]. Retrieved from https://github.com/ML4LitS/annotation_models
- Bibtex
@software{tirunagari2023accelerating, author = {Tirunagari, Santosh and Harisson, Melissa}, doi = {}, month = {06}, title = {Accelerating Biomedical Named Entity Recognition with Quantised EPMCA Bioformer-8L (QEB8L) Model}, url = {https://github.com/ML4LitS/annotation_models}, version = {0.0.1}, year = {2023} }
CC-by