/Abstractive-Summarization-and-Question-Answering-of-Medical-Texts-using-T5

Fine-tuning T5 for abstractive summarization and question answering of medical texts, simplifying complex medical information into patient-friendly language.

Primary LanguageJupyter NotebookMIT LicenseMIT

MedEaseIne: Simplifying Medical Info

Overview 🧐

MedEaseIne ⭐ is a project aimed at summarizing medical texts and documents into patient-friendly summaries. It also facilitates question-answering based on medical context using state-of-the-art models. We have utilized the T5 (Text-To-Text Transfer Transformer) model from Hugging Face Transformers library to perform abstractive summarization and question answering on medical texts. The T5 model is fine-tuned on medical domain-specific data (the PubMed Subset Bulk and SumPubMed) to generate concise summaries. Likewise, we have also utilized Google Gemini API to extract information from texts or documents and summarize them into patient-friendly language as well as facilitate question-answering.

DemoπŸ“Ή

Demo Gif

Directory Structure πŸ—‚οΈ

.
β”œβ”€β”€ app/
β”‚   └── models/
β”‚       β”œβ”€β”€ question_answering/
β”‚       β”‚   └── checkpoint-1500/
β”‚       └── summarization/
β”‚           └── summarization_final_trained_model
β”œβ”€β”€ static/
β”‚   β”œβ”€β”€ css/
β”‚   β”‚   └── styles.css
β”‚   β”œβ”€β”€ images/
β”‚   └── js/
β”‚       └── script.js
β”œβ”€β”€ templates/
β”‚   β”œβ”€β”€ about_us.html
β”‚   β”œβ”€β”€ contact_us.html
β”‚   β”œβ”€β”€ home.html
β”‚   β”œβ”€β”€ layout.html
β”‚   β”œβ”€β”€ qna.html
β”‚   β”œβ”€β”€ result.html
β”‚   └── summarize.html
β”œβ”€β”€ utils/
β”‚   β”œβ”€β”€ question_answering.py
β”‚   └── summarization.py
β”œβ”€β”€ __init__.py
β”œβ”€β”€ forms.py
β”œβ”€β”€ routes.py
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ dataset_creation_pubmed_subset.ipynb
β”‚   β”œβ”€β”€ fine_tune_question_answer.ipynb
β”‚   β”œβ”€β”€ finetuning_T5_for_summarization.ipynb
β”‚   β”œβ”€β”€ subset_sumpubmed.py
β”‚   └── sumpubmed_dataset_script.py
β”œβ”€β”€ .gitignore
β”œβ”€β”€ LICENSE
β”œβ”€β”€ README.md
β”œβ”€β”€ requirements.txt
└── run.py

Datasets and Fine-Tuning πŸ—„οΈ

  • All the datasets that were used to fine-tune our models can be found here.
  • All the scripts that were used to train our models can be found here.
  • This script was used to extract abstracts, and the whole content from PubMed articles and then the CHV dataset was used to replace complex terms with Consumer Vocabulary to simplify the abstracts. The simplified abstracts were used as target text to our T5-small model and the content were used as source.
  • This script and this script were used to create a subset of SumPubMed dataset.
  • This script was used to fine-tune our T5-small model for summarization task in CoLab environment.
  • This script was used to fine-tune our T5-base model for the task of question-answering. SQuAD dataset was used for the fine-tuning process.

Features βš™οΈ

  • Medical Documents and Texts Summarization πŸ“:
    • T5 for medical articles summarization.
    • Gemini for summary generation in patient-friendly language.
    • Gemini for summarization of medical documents like Discharge Summaries, Medical Histories, Diagnostic Reports, Clinical Notes, Treatment Plans, etc.
  • Question Answering ❓:
    • Provides accurate answers to medical questions based on given medical contexts.
    • Provides user the flexibility to ask any question related to the context or some common Health queries.

Tools and Technologies Used πŸ€–

  • Python
  • Flask
  • HTML/CSS/JavaScript
  • Pandas
  • NumPy
  • Tensorflow
  • PyTorch
  • Transformers
  • Google Gemini API
  • Google CoLab

Installation πŸ› οΈ

1. Clone the Repository:

git clone https://github.com/Tangsang2003/Abstractive-Summarization-and-Question-Answering-of-Medical-Texts-using-T5.git

2. Create Virtual Environment:

  • For Windows:
python -m venv venv
  • For Linux and MacOS:
python3 -m venv venv
  • Activating the virtual environment For Windows:
venv\Scripts\activate
  • For Linux and MacOS:
source venv/bin/activate

3. Install Dependencies

pip install -r "requirements.txt"

4. Download and configure ML models

  • Go to app and create directories:
.
β”œβ”€β”€ app/
β”‚   └── models/
β”‚       β”œβ”€β”€ question_answering/
β”‚       β”‚   └── checkpoint-1500/
β”‚       └── summarization/
β”‚           └── summarization_final_trained_model
$ cd app
$ mkdir models
$ cd models
$ mkdir question_answering
$ cd question_answering
$ mkdir checkpoint-1500
$ cd ..
$ mkdir summarization
$ cd summrization
$ mkdir summarization_final_trained_model
  • Download T5 model for Summarization from here.
  • Copy all the files to the summarization_final_trained_model directory.
  • Download T5 model for Question Answering from here.
  • Copy all the files to the checkpoint-1500 directory.
  • Obtain GOOGLE_API_KEY from here.
  • Setup the SECRET_KEY and GOOGLE_API_KEY in your system's Environment Variables.
  • You can set your SECRET_KEY to be anything.

5. Run Application

python run.py

Contributing 🀝

If you'd like to contribute to this project, please follow these steps:

  1. Fork the repository.
  2. Create a new branch for your feature: git checkout -b feature-name.
  3. Commit your changes: git commit -m 'Add some feature'.
  4. Push to the branch: git push origin feature-name.
  5. Submit a pull request.

Future Works? πŸ”œ

  • Development of a user feedback mechanism to improve our T5 model for summarization and question-answering.
  • Creation of further-refined datasets.
  • Deployment of the web application on AWS, Azure or Google Cloud.
  • Explore partnerships with healthcare institutions, research organizations, or educational platforms to integrate MedEaseIne into clinical workflows, medical education, or research activities.

Thank You πŸ™