/CDC-Diabetes-Health-Indicators

MLZoomCamp midterm project. Training decision tree based models and doing model deployment using Amazon Elastic Beanstalk.

Primary LanguageJupyter Notebook

CDC Diabetes Health Indicators
(MLZoomCamp midterm project)

Table of Contents:

  1. πŸ“– Introduction

  2. πŸ“‹ Dataset Information

  3. πŸ“Š EDA and 🧠 Model Training

    1. πŸ› οΈ Virtual Environment Setup
    2. πŸ““ Run Jupyter Notebook
    3. πŸ’‘Information from EDA and Model Training
    4. πŸ“€ Export Notebook to Python Script
  4. 🧩 Model Deployment

    1. βš™οΈ Test Model Deployment
  5. πŸ‹ Containerization

    1. πŸ› οΈ Create Pipfile and Pipfile.lock
    2. ▢️ Run Docker Container and Test the Service
  6. ☁️ Cloud Deployment

    1. πŸ“‹ Prerequisites
    2. ▢️ Run and Test Cloud Deployment
  7. MLZoomCamp Midterm Project General Information

Introduction

πŸ“– This project uses the CDC Diabetes Health Indicators dataset that can be used for training a model to predict if persons are diabetic/pre-diabetic or non-diabetic diabetes based on their heath records. The dataset was created to to better understand the relationship between lifestyle and diabetes in the US and the creation was funded by the CDC (Center for Disease Control and Prevention).

Problem: Diabetes is a severe illness that can lead to serious health problems such as heart disease, blindness, kidney failure, and so on. Detecting the illness in an early stage can help to prevent or delay these health issues.

Task: This midterm project aims to build a service that predicts whether a patient is (pre-)diabetic or healthy using the previous mentioned data provided by the "CDC Diabetes Health Indicators Dataset".

More information and insights about the dataset can be found in the Dataset Information section.

Dataset Information

πŸ“‹ Information regarding the dataset

  • Dataset source and dataset download location
  • Features variables
  • Target variable

πŸ”— Dataset page:

The CDC Diabetes Health Indicators dataset is available on the UCI Machine Learning Repository.

Information from the dataset page:

  • Each row of the dataset represents a person participating in the study.
  • The dataset contains 21 feature variables (categorical and integer) and 1 target variable (binary).
  • Cross validation or a fixed train-test split could be used for data splits.
  • The dataset contains sensitive data such as gender, income, and education level.
  • Data preprocessing was performed by bucketing of age. The dataset has no missing values.

Quoted from the dataset page

"The Diabetes Health Indicators Dataset contains healthcare statistics and lifestyle survey information about people in general along with their diagnosis of diabetes. The 35 features consist of some demographics, lab test results, and answers to survey questions for each patient. The target variable for classification is whether a patient has diabetes, is pre-diabetic, or healthy."

Remark on the quote above
The quote states that the dataset contains 35 features. However, the dataset page further states

Information
Dataset Characteristics Tabular, Multivariate
Subject Area Life Science
Associated Tasks Classification
Feature Type Categorical, Integer
# Instances 253680
# Features 21

πŸ’‘ We will check this discrepancy in when digging into the dataset during the EDA (Exploratory Data Analysis).

➑️ From the dataset information we can see that the task will be a binary classification task with 21 features and 1 target variable.

Download is provided via

  1. Python API using the ucimlrepo package.
  2. On the project page (https://archive.ics.uci.edu/dataset/891/cdc+diabetes+health+indicators) there is a reference to the dataset source which redirects to Kaggle:
    https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset

πŸ’‘ In this project, we utilize the ucimlrepo Python package to download the initial dataset. To ensure reproducibility, all relevant data for exploratory data analysis (EDA) and training will be stored locally in the (./dataset folder. This approach safeguards against potential issues, such as unavailability or changes to the dataset in the UCI Machine Learning Repository over time.

πŸ“₯ How the dataset was downloaded and stored locally is described in the EDA notebook notebook.ipynb. The dataset and parts of the metadata are downloaded the notebook.ipynb and stored in the ./dataset folder locally.

πŸ“Š During EDA the generated dataset splits (train, validation, test) have been downloaded to separate files. These will be used later on for running the training on the 'full training' dataset (train + validation) in the final train.py script.

EDA and Model Training

πŸ“Š EDA stands for Exploratory Data Analysis, which is the process of analyzing data sets to summarize the main characteristics, often including visualizations of the data. EDA is used for seeing what the data can tell us about the data.

🧠 Model training is the process of training a machine learning model to make predictions based on data. The model training process includes the following steps:

  1. Data preprocessing
  2. Model training
  3. Model evaluation

πŸ““ The EDA and model training will be performed in a Jupyter Notebook notebook.ipynb. In this notebook, we will perform the following steps:

  1. Exploratory Data Analysis (EDA)
  2. Data preprocessing
  3. Model training
    • Different algorithms (tree-based and linear models)
    • Different hyperparameters
  4. Model evaluation
  5. Model selection

All required steps for setting up the virtual environment to run the notebook are described in the πŸ› οΈ Virtual Environment Setup section.

🐍 For training the final model after the model selection step we will extract the required Python code to a script called train.py. This notebook will be used to train the final model and save it to disk. Steps covered in the train.py script will be:

  1. Data preprocessing
  2. Model training
  3. Model evaluation
  4. Model storage

Virtual Environment Setup

🐍 Setting up the virtual environment for πŸ“Š EDA and 🧠 Model Training using Miniconda with Python 3.10.12. All required packages will be installed from the requirements-eda.txt file. There the packages are listed with their version number to ensure reproducibility.

  1. Installing Miniconda

  2. Creating the virtual environment using Python 3.10.12

    conda create --name mlzoomcamp-midterm python=3.10.12
  3. Activating the virtual environment

    conda activate mlzoomcamp-midterm
    • The command prompt should now indicate that the virtual environment is activated and show the name of the virtual environment in parentheses (mlzoomcamp-midterm).
      Within the activated virtual environment (mlzoomcamp-midterm) perform the following steps:

      1. Install the requirements from the requirements-eda.txt
        pip install -r requirements-eda.txt
      2. Start JupyterLab to check if the installation was successful
        jupyter lab

Additional information:
The commands above worked in WSL2 (Windows Subsystem for Linux) on Windows 11 and should be the same on Linux. The conda version installed on my system is 23.0.9 (Conda command reference 23.9.x.
In case you are using a different conda version and the conda commands do not work on your system, check the conda cheat-sheet of your installed conda version for the correct commands.

Run Jupyter Notebook

Information on ▢️ running the πŸ““ Jupyter Notebook notebook.ipynb.

The previous created virtual environment (mlzoomcamp-midterm) has JupyterLab installed. In order to start JupyterLab, the virtual environment needs to be activated first. Activate the virtual environment that we created in the previous section πŸ› οΈ Virtual Environment Setup.

# navigate to project directory, the location and command might differ on your system
cd CDC-Diabetes-Health-Indicators

# activate the virtual environment using 'mlzoomcamp-midterm
conda activate mlzoomcamp-midterm

# within the activated environment indicated by '(mlzoomcamp-midterm)' start JupyterLab 
jupyter lab

Information from EDA and Model Training

πŸ’‘Insights and Results from the πŸ“Š EDA (exploratory data analysis) and the 🧠 Model Training.

EDA - Variables (Target and Features)

πŸ’‘Information revealed during the πŸ“Š EDA from the dataset and its metadata.

The following data was retrieved after downloading the dataset in the EDA notebook.ipynb using the ucimlrepo Python package and storing the 'variables' information to dataset/variables.csv. In order to not duplicate data in multiple files the following data hast been stored here and not in the EDA notebook.

Type Description
ID Integer Patient ID
Target variable Type Description
Diabetes_binary Binary 0 = no diabetes
1 = prediabetes or diabetes

Features in are sorted in the table below using their data type:

  • Integer
  • Categrical
  • Binary

Information for binary features (except for feature Sex):

  • 0 = no
  • 1 = yes
Features Type Description
BMI Integer Body Mass Index
MentHlth Integer Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good?
scale 1-30 days
PhysHlth Integer Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good?
scale 1-30 days
GenHlth Integer (Categorical) Would you say that in general your health is: scale 1-5
1 = excellent
2 = very good
3 = good
4 = fair
5 = poor
Age Integer (Categorical) Age,13-level age category (_AGEG5YR see codebook)
1 = 18-24
9 = 60-64
13 = 80 or older
Education Integer (Categorical) Education level (EDUCA see codebook) scale 1-6
1 = Never attended school or only kindergarten
2 = Grades 1 through 8 (Elementary)
3 = Grades 9 through 11 (Some high school)
4 = Grade 12 or GED (High school graduate)
5 = College 1 year to 3 years (Some college or technical school)
6 = College 4 years or more (College graduate)
Income Integer (Categorical) Income scale (INCOME2 see codebook) scale 1-8
1 = less than $10,000
5 = less than $35,000
8 = $75,000 or more"
Sex Binary Sex, 0 = female 1 = male
HighBP Binary High blood preasure
HighChol Binary High cholesterol
CholCheck Binary Cholesterol check in 5 years
Smoker Binary Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes]
Stroke Binary (Ever told) you had a stroke.
HeartDiseaseorAttack Binary Coronary heart disease (CHD) or myocardial infarction (MI)
PhysActivity Binary Physical activity in past 30 days - not including job<
Fruits Binary Consume Fruit 1 or more times per day
Veggies Binary Consume Vegetables 1 or more times per day
HvyAlcoholConsump Binary Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week)
AnyHealthcare Binary "Have any kind of health care coverage, including health insurance, prepaid plans such as HMO, etc.
NoDocbcCost Binary Was there a time in the past 12 months when you needed to see a doctor but could not because of cost?
DiffWalk Binary Do you have serious difficulty walking or climbing stairs?

EDA - Missing Values, Duplicates, Imbalances, etc.

πŸ’‘Information revealed about the dataset during πŸ“ŠEDA regarding

  • Missing values
    • βœ… As stated in the dataset information, the dataset has no missing values
  • Duplicates
    • βœ… There are duplicate rows in the dataset when not taking into account the patient ID. This is due to the fact that feature variables are categorical, binary and integer. The integer values have either only value ranges between 1 and 30 or are discrete values although the original values were floating point (feature BMI). Therefore these rows represent different patients that just represent the same feature variables due to their nature.
  • Imbalances
    • βœ… The dataset is highly imbalanced with respect to the target varibale
      • 14% (pre-)diabetic
      • 86% non-diabetic

The BMI (body mass index) is calculated using the following formula. The result is a floating point number, but in the dataset the BMI is stored as an integer. This means that the BMI is rounded to the next integer.

$$ BMI_{\text{float}} = \frac{mass_{kg}}{height_{m}^2} $$

$$ BMI_{\text{integer}} = \text{integer}\left(\frac{mass_{kg}}{height_{m}^2} + 0.5 \right) $$

Export Notebook to Python Script

πŸ“€ The code for training the final model got exported to the 🐍 Python script train.py. The script will covers the following tasks:

  • Loading the dataset splits: train, validation, and test
  • Creating a test dataset consisting of the test split
  • Creating a 'full training' dataset consisting of training split and validation split
  • Training the model on the 'full training' (train + validation) dataset
  • Evaluating the model on the test dataset
    • Printing the metrics to the command line
  • Saving the following data to files (bin and json)
    • Model
    • DictVectorizer (fitted on 'full training' dataset)
    • Normalization values (determined on 'full training' dataset in order to normalize the value ranges of some feature variables)

🐍 For running the train.py script make sure the development environment defined in section πŸ› οΈ Virtual Environment Setup is activated before running the following commands.

# 🐍 Activate the development environment
conda activate mlzoomcamp-midterm

# ▢️ Execute the training script
python train.py

🎲 We will now randomly sample an entry from the test dataset for testing the model later on. For this purpose the script sample_from_test.py is used. The script will randomly sample a test dataset entry (row) and store it as JSON file test_sample.json.

# 🐍 Activate the development environment, if not already done
conda activate mlzoomcamp-midterm

# 🎲 sample randomly without seed point
python sample_from_test.py

# 🎲 sample randomly using a specific seed point
python sample_from_test.py --seed 1234

This test_sample.json will be used when testing the model during the next step the Model Deployment.

Model Deployment

🧩 For deploying create a new virtual environment for testing the deployment.

  1. Create the a environment using Python 3.10.12

    conda create --name deployment-midterm python=3.10.12
  2. Activate the virtual environment

    conda activate deployment-midterm

    Within the activated virtual environment (deployment-midterm) install the requirements from the requirements-deployment.txt

    pip install -r requirements-deployment.txt

Test Model Deployment

βš™οΈ Test the deployment script starting the predict service will require two terminal windows

  1. Terminal windows #1: Run the predict service

    # activate 'deployment-midterm', if not already activated 
    conda activate deployment-midterm 
    # start the predict service
    python predict.py
  2. Terminal window #2: Execute the http-request using test_sample.py, which will use the sample from test_sample.json

    # activate 'deployment-midterm', if not already activated 
    conda activate deployment-midterm
    # test the predict service using the sample from 'test_sample.json'
    python test_predict.py

Containerization

πŸ‹ Putting the prediction service in a Docker container, which requires Docker being installed on your system.

Create Pipfile and Pipfile.lock

πŸ› οΈ Create a Pipfile and Pipfile.lock for containerization using pipenv

  1. Install pipenv
    pip install pipenv==2023.10.24
  2. Create a Pipfile and Pipfile.lock based on the provided requirements-eda.txt
    pipenv install -r requirements-deployment.txt

Run Docker Container and Test the Service

The Docker image docker pull ai2ys/mlzoomcamp-midterm-project:0.0.0 has been pushed to the πŸ‹ DockerHub registry. Therefore you can run the container without prior building by just running the container, which will pull the image from DockerHub.

  1. Optional: Building the Docker image ai2ys/mlzoomcamp-midterm-project:0.0.0

    docker build -t ai2ys/mlzoomcamp-midterm-project:0.0.0 .
  2. Running the Docker container (terminal windows #1)

    docker run --rm -p 9696:9696 ai2ys/mlzoomcamp-midterm-project:0.0.0
  3. Testing the prediction service in the Docker container from the virtual environment (deployment-midterm).
    Open a new terminal window and execute the following commands (terminal window #2)

    # activate the virtual environment
    conda activate deployment-midterm
    # run the test script
    python test_predict.py	

Cloud Deployment

☁️ Instructions for the cloud deployment using AWS Elastic Beanstalk.

Prerequisites

Run and Test Cloud Deployment

🎞️ Video of cloud deployment showing all steps below: πŸ”— https://youtu.be/eu-TP17kvwc

▢️ Steps for creating and running the prediction service on AWS Elastic Beanstalk.

  • Initialize AWS Elastic Beanstalk project

    # activate the virtual environment
    conda activate awsebcli
    # initialize eb, select region, specify credentials
    eb init -p "Docker running on 64bit Amazon Linux 2023" -r eu-west-1 --profile <profile> mlzoomcamp-midterm-project
  • Test locally using Elastic Beanstalk

    1. Terminal window #1: Using AWS Elastic Beanstalk to run the service locally
      # activate the virtual environment
      conda activate awsebcli
      eb local run --port 9696
    2. Terminal window #2: Run the test_predict.py script
      # activate the virtual environment 'deployment-midterm'
      conda activate deployment-midterm
      python test_predict.py
  • Test cloud deployment using Elastic Beanstalk

    1. Terminal window #1: Create the Elastic Beanstalk environment

      # activate the virtual environment 'awsebcli'
      conda activate awsebcli
      eb create mlzoomcamp-midterm-env

      When the service is running copy the URL to the clipboard πŸ“‹

    2. Terminal window #2: Run the test_predict.py script In another terminal run the test_predict.py and insert the URL

      # activate the virtual environment 'deployment-midterm'
      conda activate deployment-midterm
      python test_predict.py --url <elastic beanstalk url>
  • When we are done running the prediction service on AWS Elastic Beanstalk

    # if not already using activate the virtual environment
    conda activate awsebcli
    eb terminate mlzoomcamp-midterm-env

MLZoomCamp Midterm Project General Information

General information about he MLZoomCamp Midterm Project can be found here: https://github.com/DataTalksClub/machine-learning-zoomcamp/tree/master/projects#midterm-project

Information for cohort 2023 can be found here: https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/cohorts/2023/projects.md#midterm-project