Project repository for Health Informatics course at Gatech. (CSE 6250)
New medical studies provide a rich source of material for doctors that can be utilized to improve patient outcomes in novel ways. In this study, we explore the feasibility of improving patient mortality predictions using text features generated by NLP transformers, and if so, are the improvements to the prediction scores attributable to the specific transformer used. We used two types of transformers: a generic transformer trained on PubMed text (BlueBERT), and a use-case specific transformer trained on coronavirus text (CORD-19). For comparison, we also trained two other patient mortality models: (1) trained on structured data only; (2) trained on structured data and text features generated using TF-IDF. Results show that the second model trained on structured data and TF-IDF text features outperforms the BlueBERT trained model, and that there is no significant difference in performance between the BluBERT and CORD-19 trained models.
All project related files are contained in the src
directory.
src\ML_Results.ipynb
: Jupyter notebook containing results of the study.src\preprocessing.py
: PySpark script containing the preprocessing logic and is meant to run in a Spark cluster.src\preprocessing2.py
: Python script for feature preprocessing. (downstream to #2)src\ml.py
: Python script containing the ML model training logic (derived from analysis in #1). (downstream to #3)src\preprocessing-py.py
: (not a main component of this project) Python script used for preprocessing data sample suitable in size to load on a single machine.
The outputs of preprocessing.py
and preprocessing2.py
will appear under data/processed/spark-etl/
and data/processed/spark-processed-features/
This project makes use of the following:
- MIMIC III Dataset (access permissions required). Once you have access, you can place the data under
data/raw/
from root, and the project files that require the data as input (preprocessing.py
andpreprocessing2.py
) can be executed - Spark 3.2 (PySpark along with Pandas API)
- Scikit learn's RandomForestClassifier
- Huggingface pretrained-models
- DVC: to maintain Data and model version control
- GCP storage
An exhaustive list of project requirements can be found in:
After setting up the environment, you can run files in the following order (dependency graph):
preprocessing.py -> preprocessing2.py -> ml.py
python preprocessing.py # This can also run in a cluster for parallelism
python preprocessing2.py # This should be run in a local laptop
python -i ml.py # for experimenting with ML models