by - Chetan Sarda, Zixuan Zhu, Yinong Yao, Randeep Singh
This project aims to predict the likelihood of heart attacks based on various health indicators using machine learning techniques. The goal is to assist healthcare professionals in identifying high-risk individuals for preventive measures.
To set up the project, install the required packages:
pip install pandas numpy seaborn matplotlib scikit-learn imbalanced-learn fairlearn shap
The dataset contains health-related features such as age, gender, BMI, smoking status, physical activity, and more. It is preprocessed to handle missing values, encode categorical variables, and normalize numerical features. Link: https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease
Features are selected using Recursive Feature Elimination with Cross-Validation (RFECV) to identify the most relevant predictors for heart attack risk.
Several classifiers are evaluated, including Logistic Regression, K-Nearest Neighbors, Decision Tree, Support Vector Machine, Gaussian Naive Bayes, and Random Forest. The models are tuned using hyperparameter optimization to enhance performance.
The Random Forest Classifier is chosen as the final model based on its balance between accuracy and generalization.
The model's fairness is assessed using the Fairlearn library to ensure that predictions are equitable across different demographic groups.
Predictions are categorized into different risk levels (Low, Medium, High) to aid in prioritizing medical interventions.
To run the project, execute the Jupyter notebook that includes data preprocessing, model training, and evaluation.