Ensembling Classical Machine Learning and Deep Learning Approaches for Morbidity Identification From Clinical Notes
This repository contains the code for the Research Paper: Ensembling Classical Machine Learning and Deep Learning Approaches for Morbidity Identification From Clinical Notes started for Final Project in CS 598 Deep Learning in Healthcare at University of Illinois Urbana-Champaign. This paper demonstrates and explores the use of ensembling techniques to combine classical machine learning and deep learning approaches for more accurate morbidity identification from clinical notes.
Getting Started
In this paper we have used n2c2 NLP Dataset for which we had to request access from Harvard Medical School Portal. This dataset was originally generated during the i2b2 project.
Prerequisites
Clone the repository
git clone https://github.com/shhr3y/DLH-Proj
We highly recommend to use a virtual enviroment for this project as there might be some dependencies which might cause version conflicts. We are using Miniforge for creating our environment.
- run
brew install miniforge
to install miniforge on your system. - run
conda create --name dlh python=3.10
. (we recommend using python version 3.10 for this project) - run
conda activate dlh
To install all the dependencies for this projects:
- run
pip install -r requirements.txt
Dataset Generation
Before proceeding, Harvard Medical School Portal is in xml format which we need to convert to csv for training. To convert .xml data to .csv format, run xml2csv.py
which will generate a csv file containing all columns such as [id, text, Asthma, CAD, CHF, ..]
- run
python xml2csv.py
After generating the csv file, we need to apply our pre-processing functions (tokenization, lowercase, remove non-alpha chars, lemmatize, one-hot encode). For this purpose we can use PreProcess class with inputs as input_file_destination(generated csv file path) and output_file_destination(output csv file pash) which is in pre_processing.py
which can be import into other scripts and used.
We have used Weka for some machine learning models such as J-Rip and J48. Weka tool needs its input as .arff file needs to be generated. We have used this script for pandas2arff and generated the required files needed by the Weka library. This function has been used to create dataset for TF-IDF and WordEmbedding representations. We have multiple word embedding techniques used (code in feature_generation.py
) such as Word2Vec, GloVe, FastText, Universal Sentence Encoder.
- run
python screate_arff_tfidf.py
for generating TF-IDF dataset. - run
python create_arff_word_embeddings.py
for generating WordEmbeddings dataset.
These instructions will get you a all the dataset files required for the project ready on your local machine for development and testing purposes.
Classical Machine Learning Models
We have implemented several Machine Learning Models such as DecisionTree, J-48, JRip, KNN, NaiveBayes, RandomForest and SVM. The code for the same models can be found under /ML/
directory. Further, we have used two types of textual representations TF-IDF and WordEmbeddings which can be found at ML/tf-idf
and ML/word-embeddings
.
Deep Learning Model
We have implemented Stat Bidirectional Long-Short Term Memory (BiLSTM) Layer RNN with n = 2 and 128 and 64 hidden layers. We have used Binary Cross-Entrophy as our loss function and have used several word embeddings such as GloVe, FastText, UniversalSentenceEncodder, Word2Vec. The code for this deep-learning model can be found at /DL/.
These instructions will get you a all the results of different models ready on your local machine for development and testing purposes.
Results
CML(RN)Preformance with SMOTE and ExtraTrees
We have acheived better results with RandomForest with SMOTE and ExtreTrees classifier for feature selection in TF-IDF when compared RandomForest with AllFeatures without any feature selection or over sampling in TF-IDF, which depicts that feature selection increases performace.
Morbidity Class | w/o SMOTE & ExtraTrees RFMacro F1 | w/o SMOTE & ExtraTrees RF_Micro F1 | SMOTE & ExtraTrees RF_Macro F1 | SMOTE & ExtraTrees RF_Micro F1 |
---|---|---|---|---|
Asthma | 0.49245763267483306 | 0.8810949788263762 | 0.9888628096539887 | 0.989019801980198 |
CAD | 0.9178062504459759 | 0.9233333333333335 | 0.9266441020391909 | 0.9276923076923078 |
CHF | 1.0 | 1.0 | 1 | 1 |
Depression | 0.5138524072645619 | 0.8058153126826418 | 0.9339450018380673 | 0.9347826086956521 |
Diabetes | 0.8739219960299753 | 0.901441102756892 | 0.9771615914904839 | 0.9772943037974684 |
Gallstones | 0.4599902140879871 | 0.8532485875706215 | 0.9294916354225806 | 0.9308289652494661 |
GERD | 0.4323776782842145 | 0.7638605442176871 | 0.877896163015734 | 0.879045045045045 |
Gout | 0.4647957966912661 | 0.8691242937853108 | 0.9548711662946365 | 0.955563853622106 |
Hypercholesterolemia | 0.8400652419091632 | 0.8425490196078431 | 0.8845109536038042 | 0.8872641509433963 |
Hypertension | 0.50500773716472 | 0.81743535988819 | 0.9610438529632302 | 0.9615184678522573 |
Hypertriglyceridemia | 0.4854267614497737 | 0.9438924605493864 | 0.9836073013237844 | 0.9837510237510237 |
OA | 0.46509042983465887 | 0.8282581453634086 | 0.9257431596498952 | 0.9271219400594829 |
Obesity | 0.9076256375579452 | 0.913116883116883 | 0.9771596670227283 | 0.9777009728622632 |
OSA | 0.5519158380668717 | 0.8711864406779661 | 0.9736740853175949 | 0.974296253154727 |
PVD | 0.5783369875048958 | 0.8632792207792208 | 0.975896847083613 | 0.9765614275909403 |
Venous_Insufficiency | 0.47801077753999144 | 0.9164368650217707 | 0.9736143705713205 | 0.9741194158075602 |
Average | 0.6229175866566771 | 0.8746295342610958 | 0.952757669205666 | 0.9535350336314934 |
DL Performace with different word embeddings
After testing our deep learning models with different word embedding we found that USE(Universal Sentence Encoder).
Embeddings | Avg Macro F1 | Avg Micro F1 |
---|---|---|
Word2Vec | 65.98 | 81.99 |
Glove | 63.87 | 78.95 |
FastText | 66.65 | 81.94 |
USE | 74.00 | 85.69 |
Authors
kshitij6798
shhr3y