Context
I've been hired (hypothetically) by the Johns Hopkins Hospital to create a machine learning model to predict whether or not a patient is likely to suffer a stroke. Being able to predict this will allow doctors to advise patients and their families on how to reduce cet risk, but also on how to act in the case of an emergency.
The project is based upon the Stroke Prediction Dataset on Kaggle that's been uploaded by the dataset grandmaster Fedesoriano.
Project Structure
.
└── Project home -> # Main directory
├── models -> # Saved models directory
│ ├── deployment # Deployment model directory
├── src -> # Source file directory
│ ├── data # Data directory
│ │ └── healthcare-dataset-stroke-data.csv # Dataset in csv format
│ └── lib # Code directory
│ └── helper_functions.py # File with functions used in the notebook
├── 325.ipynb # Assignment
├── my.log # Log file for all logged outputs
├── project.ipynb # Notebook containing the project
├── readme.md # This readme file
└── requirements.txt # Requirements file for package installation
Usage
Open the project.ipynb
here on GitHub, or open it in your preferred editor. Installation of the necessary modules is done in the notebook itself so no need to do anything else.
Table of Contents
- Healthcare: Stroke Prediction
- Setup
- Installation and Imports
- Initial Setup
- Data Loading and Exploration
- Data Loading
- First Exploration
- Data Cleaning
- EDA: Exploratory Data Analysis
- Univariate Analysis
- Gender
- Age
- Hypertension
- Heart Disease
- Ever Married
- Work Type
- Residence Type
- Average Glucose Level
- BMI : Body Mass Index
- Smoking Status
- Stroke
- Multivariate Exploration
- Gender
- Age
- Hypertension
- Heart Disease
- Ever Married
- Work Type
- Residence Type
- Average Glucose Level
- BMI: Body Mass Index
- Smoking
- Correlations
- Univariate Analysis
- Statistical Analysis
- Hypothesis 1
- Machine Learning
- Data Loading
- Data Preparation
- Train - test split
- Data Preprocessing
- Model Training and Evaluation
- Logistic Regression
- Hyperparameter Optimization
- Model Evaluation
- Random Forest
- Support Vector Machine
- K-Nearest Neighbors
- Model Ensembling
- XGBoost
- Logistic Regression
- Optimization
- Model Deployment
- Setup
Attributes in the dataset
If you'd like, you can already read the attributes that are in the dataset below. These are of course also covered in the project itself.
-
id : unique identifier
-
gender : "Male", "Female" or "Other"
-
age : age of the patient
-
hypertension : 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
-
heart_disease : 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
-
ever_married : "No" or "Yes"
-
work_type : "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
-
Residence_type : "Rural" or "Urban"
-
avg_glucose_level : average glucose level in blood
-
bmi : body mass index
-
smoking_status : "formerly smoked", "never smoked", "smokes" or "Unknown"*
-
stroke : 1 if the patient had a stroke or 0 if not
*Note: "Unknown" in smoking_status means that the information is unavailable for this patient