Big Data for Health Informatics

Knowledge Transfer with Transformers

Overview

Project repository for Health Informatics course at Gatech. (CSE 6250)

New medical studies provide a rich source of material for doctors that can be utilized to improve patient outcomes in novel ways. In this study, we explore the feasibility of improving patient mortality predictions using text features generated by NLP transformers, and if so, are the improvements to the prediction scores attributable to the specific transformer used. We used two types of transformers: a generic transformer trained on PubMed text (BlueBERT), and a use-case specific transformer trained on coronavirus text (CORD-19). For comparison, we also trained two other patient mortality models: (1) trained on structured data only; (2) trained on structured data and text features generated using TF-IDF. Results show that the second model trained on structured data and TF-IDF text features outperforms the BlueBERT trained model, and that there is no significant difference in performance between the BluBERT and CORD-19 trained models.

Structure

All project related files are contained in the src directory.

src\ML_Results.ipynb: Jupyter notebook containing results of the study.
src\preprocessing.py: PySpark script containing the preprocessing logic and is meant to run in a Spark cluster.
src\preprocessing2.py: Python script for feature preprocessing. (downstream to #2)
src\ml.py: Python script containing the ML model training logic (derived from analysis in #1). (downstream to #3)
src\preprocessing-py.py: (not a main component of this project) Python script used for preprocessing data sample suitable in size to load on a single machine.

The outputs of preprocessing.py and preprocessing2.py will appear under data/processed/spark-etl/ and data/processed/spark-processed-features/

Setup

This project makes use of the following:

MIMIC III Dataset (access permissions required). Once you have access, you can place the data under data/raw/ from root, and the project files that require the data as input (preprocessing.py and preprocessing2.py) can be executed
Spark 3.2 (PySpark along with Pandas API)
Scikit learn's RandomForestClassifier
Huggingface pretrained-models
DVC: to maintain Data and model version control
GCP storage

An exhaustive list of project requirements can be found in:

miniconda environment file

Test

After setting up the environment, you can run files in the following order (dependency graph):

preprocessing.py -> preprocessing2.py -> ml.py