Project_4
Group Members:
Brian Lee
Yu-Hsi (Joy) Chen
Ilkay Ates
Roy Jiang
Alexis Valdez
Project Description:
This project will utilize healthcare data from stroke patients to gather extensive clinical data. The collected data will be anlyzed and build the machine learning model to understand which features (gender, age, hypertension, and etc) mainly determine stroke potential. The analysis will include the correlation between each features and targeted result (have stroke or not). The project will provide valuable insights for doctors, clinical studies, and the general public health on the effects and prevention of strokes. Pandas, maplotlib, and SQL will be utilized to visualize the stroke predictions.
Questions to ask:
- What Socioeconomic status of people will have a higher chance of strokes?
- Will analyzing data reveal that strokes are affected by marriage?
- Do smokers have a higher risk of having a stroke?
- Is diet affecting likelihood of having a stroke?
- What age is it more expected?
- Which feature in our dataset impact on the possibility of stroke?
Parameters to consider:
gender
age
hypertension
heart_disease
ever_married
work_type
Residence_type
avg_glucose_level
bmi
smoking_status \
Resources:
https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset
Breakdown of Tasks:
- Research potential data sources
- Machine learning:
The data is cleaned, normalized, and standardized prior to modeling (from sql or spark)
Machine learning model building - Data Visualization:
Matplotlib
Tableau
Methods and Results:
We used random forest classifier to first conduct machine learning. We achived good accuracy but very low recall for '1' stroke cases. We then attempted to reduce unbalanced data by using smote() and improved recall by 10%.
Smote:
We further used Logisticregression and SVM with various model tuning and achieved ~10-17% recall. Logistic regression:
SVM: