Healthcare: Stroke Prediction

Context

I've been hired (hypothetically) by the Johns Hopkins Hospital to create a machine learning model to predict whether or not a patient is likely to suffer a stroke. Being able to predict this will allow doctors to advise patients and their families on how to reduce cet risk, but also on how to act in the case of an emergency.

The project is based upon the Stroke Prediction Dataset on Kaggle that's been uploaded by the dataset grandmaster Fedesoriano.

Project Structure

.
└── Project home ->                                     # Main directory
    ├── models ->                                       # Saved models directory
    │   ├── deployment                                  # Deployment model directory  
    ├── src ->                                          # Source file directory
    │   ├── data                                        # Data directory
    │   │   └── healthcare-dataset-stroke-data.csv      # Dataset in csv format
    │   └── lib                                         # Code directory
    │       └── helper_functions.py                     # File with functions used in the notebook
    ├── 325.ipynb                                       # Assignment
    ├── my.log                                          # Log file for all logged outputs
    ├── project.ipynb                                   # Notebook containing the project
    ├── readme.md                                       # This readme file
    └── requirements.txt                                # Requirements file for package installation

Usage

Open the project.ipynb here on GitHub, or open it in your preferred editor. Installation of the necessary modules is done in the notebook itself so no need to do anything else.

Table of Contents

Healthcare: Stroke Prediction
- Setup
  - Installation and Imports
  - Initial Setup
- Data Loading and Exploration
  - Data Loading
  - First Exploration
  - Data Cleaning
- EDA: Exploratory Data Analysis
  - Univariate Analysis
    - Gender
    - Age
    - Hypertension
    - Heart Disease
    - Ever Married
    - Work Type
    - Residence Type
    - Average Glucose Level
    - BMI : Body Mass Index
    - Smoking Status
    - Stroke
  - Multivariate Exploration
    - Gender
    - Age
    - Hypertension
    - Heart Disease
    - Ever Married
    - Work Type
    - Residence Type
    - Average Glucose Level
    - BMI: Body Mass Index
    - Smoking
  - Correlations
- Statistical Analysis
  - Hypothesis 1
- Machine Learning
  - Data Loading
  - Data Preparation
    - Train - test split
    - Data Preprocessing
  - Model Training and Evaluation
    - Logistic Regression
      - Hyperparameter Optimization
      - Model Evaluation
    - Random Forest
    - Support Vector Machine
    - K-Nearest Neighbors
    - Model Ensembling
    - XGBoost
  - Optimization
  - Model Deployment

Attributes in the dataset

If you'd like, you can already read the attributes that are in the dataset below. These are of course also covered in the project itself.

id : unique identifier
gender : "Male", "Female" or "Other"
age : age of the patient
hypertension : 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
heart_disease : 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
ever_married : "No" or "Yes"
work_type : "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
Residence_type : "Rural" or "Urban"
avg_glucose_level : average glucose level in blood
bmi : body mass index
smoking_status : "formerly smoked", "never smoked", "smokes" or "Unknown"*
stroke : 1 if the patient had a stroke or 0 if not

*Note: "Unknown" in smoking_status means that the information is unavailable for this patient

kkalera/Stroke-Prediction

Healthcare: Stroke Prediction