Identify the disease category basis the clinical/medical abstracts.
train.dat and test.dat files are present in data folder.
Columns
-
disease category --> 1,2,3,4,5
-
abstract --> text
Classes / Categories
disease_categories = {
1: "digestive system diseases",
2: "cardiovascular diseases",
3: "neoplasms",
4: "nervous system diseases",
5: "general pathological conditions"
}
-
Leveraging text embeddings models to convert abstract text into vector and then training a classical ML model on the vectorized text. Used OpenAI's "text-embedding-ada-002" text embeddings model.
-
Finetunning an Encoder only model for multi class classification. In this case we leverage Bio_ClinicalBERT model already trained on clinical texts. This offers the advantage of domain specific knowledge already embedded in the model. Trained this model for multi class classification task.
-
Finetunning a Large Language Model for classification task In this case we finetuned a Mistral-7B-Instruct model for medical abstract classification task.
Solution approach and details can be seen in notebooks/model_with_embeddings.ipynb
The embeddings lack the clinical domain understaing and hence model does not do so well.
Finetunning a pre-trained Bio_ClinicalBERT [emilyalsentzer/Bio_ClinicalBERT] model for multi class classification.
The base model offers the advantage of domain specific knowledge already embedded in the model.
The app is containerazed. Simply build the provided Dockerfile and run
docker build -t my-bert-app .
docker run -p 8080:8080 my-bert-app
OR
create virtual environment with python >= 3.10 and install required packages and run main.py
python -m venv myenv
source myenv/bin/activate
pip install -r requirements.txt
python main.py
Finally hit the api endpoint with following curl command / Python Client
curl -X 'POST' \
'http://localhost:8080/classify' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"abstract": "This is my clinical abstract"
}'
Python Client
import requests
url = 'http://localhost:8080/classify'
headers = {
'accept': 'application/json',
'Content-Type': 'application/json'
}
data = {
'abstract': 'This is my clinical abstract'
}
response = requests.post(url, headers=headers, json=data)
print(response.status_code)
print(response.json())
Response Json
{
'res_id':'id',
'category' : 'disease category',
'category_index':1,
'confidence':0.7
}
mistralai/Mistral-7B-v0.1 [https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1] model is finetuned to do classificatin task.
Steps in solution
- Instruct Dataset Preparation
- Finetunning the model with LoRA PEFT technique
- Merge LoRA adapters with original model and do inference
The app is containerazed. Simply build the provided Dockerfile and run
docker build -t my-mistral-app .
docker run --gpus all -p 8081:8081 my-mistral-app
OR
create virtual environment with python >= 3.10 and install required packages and run main_mistral.py
python -m venv myenv
source myenv/bin/activate
pip install -r requirements.txt
python main_mistral.py
Finally hit the api endpoint with following curl command / Python Client
curl -X 'POST' \
'http://localhost:8081/generate' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"abstract": "This is my clinical abstract"
}'
Python Client
import requests
url = 'http://localhost:8081/generate'
headers = {
'accept': 'application/json',
'Content-Type': 'application/json'
}
data = {
'abstract': 'This is my clinical abstract'
}
response = requests.post(url, headers=headers, json=data)
print(response.status_code)
print(response.json())
Response Json
{
'res_id':'id',
'category' : 'disease category',
'category_index':1,
'confidence':0.7
}