CDC Diabetes Health Indicators
(MLZoomCamp midterm project)

Table of Contents:

📖 Introduction
📋 Dataset Information
📊 EDA and 🧠 Model Training
🧩 Model Deployment
1. ⚙️ Test Model Deployment
🐋 Containerization
1. 🛠️ Create Pipfile and Pipfile.lock
2. ▶️ Run Docker Container and Test the Service
☁️ Cloud Deployment
1. 📋 Prerequisites
2. ▶️ Run and Test Cloud Deployment
MLZoomCamp Midterm Project General Information

Introduction

📖 This project uses the CDC Diabetes Health Indicators dataset that can be used for training a model to predict if persons are diabetic/pre-diabetic or non-diabetic diabetes based on their heath records. The dataset was created to to better understand the relationship between lifestyle and diabetes in the US and the creation was funded by the CDC (Center for Disease Control and Prevention).

Problem: Diabetes is a severe illness that can lead to serious health problems such as heart disease, blindness, kidney failure, and so on. Detecting the illness in an early stage can help to prevent or delay these health issues.

Task: This midterm project aims to build a service that predicts whether a patient is (pre-)diabetic or healthy using the previous mentioned data provided by the "CDC Diabetes Health Indicators Dataset".

More information and insights about the dataset can be found in the Dataset Information section.

Dataset Information

📋 Information regarding the dataset

Dataset source and dataset download location
Features variables
Target variable

🔗 Dataset page:

The CDC Diabetes Health Indicators dataset is available on the UCI Machine Learning Repository.

Information from the dataset page:

Each row of the dataset represents a person participating in the study.
The dataset contains 21 feature variables (categorical and integer) and 1 target variable (binary).
Cross validation or a fixed train-test split could be used for data splits.
The dataset contains sensitive data such as gender, income, and education level.
Data preprocessing was performed by bucketing of age. The dataset has no missing values.

Quoted from the dataset page

"The Diabetes Health Indicators Dataset contains healthcare statistics and lifestyle survey information about people in general along with their diagnosis of diabetes. The 35 features consist of some demographics, lab test results, and answers to survey questions for each patient. The target variable for classification is whether a patient has diabetes, is pre-diabetic, or healthy."

Remark on the quote above
The quote states that the dataset contains 35 features. However, the dataset page further states

Information

Dataset Characteristics Tabular, Multivariate

Subject Area Life Science

Associated Tasks Classification

Feature Type Categorical, Integer

# Instances 253680

# Features 21

	Information
Dataset Characteristics	Tabular, Multivariate
Subject Area	Life Science
Associated Tasks	Classification
Feature Type	Categorical, Integer
# Instances	253680
# Features	21

💡 We will check this discrepancy in when digging into the dataset during the EDA (Exploratory Data Analysis).

➡️ From the dataset information we can see that the task will be a binary classification task with 21 features and 1 target variable.

Download is provided via

Python API using the ucimlrepo package.
- When using the ucimlrepo package for downloading the data, there is an additional download link provided in its metadata.data_url from which the dataset CSV file can be accessed directly. https://archive.ics.uci.edu/static/public/891/data.csv
On the project page (https://archive.ics.uci.edu/dataset/891/cdc+diabetes+health+indicators) there is a reference to the dataset source which redirects to Kaggle:
https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset

💡 In this project, we utilize the ucimlrepo Python package to download the initial dataset. To ensure reproducibility, all relevant data for exploratory data analysis (EDA) and training will be stored locally in the (./dataset folder. This approach safeguards against potential issues, such as unavailability or changes to the dataset in the UCI Machine Learning Repository over time.

📥 How the dataset was downloaded and stored locally is described in the EDA notebook notebook.ipynb. The dataset and parts of the metadata are downloaded the notebook.ipynb and stored in the ./dataset folder locally.

dataframe - ./dataset/data.csv
information about variables - ./dataset/variables.csv
metadata (only some parts of it) - ./dataset/metadata_partially.json

📊 During EDA the generated dataset splits (train, validation, test) have been downloaded to separate files. These will be used later on for running the training on the 'full training' dataset (train + validation) in the final train.py script.

EDA and Model Training

📊 EDA stands for Exploratory Data Analysis, which is the process of analyzing data sets to summarize the main characteristics, often including visualizations of the data. EDA is used for seeing what the data can tell us about the data.

🧠 Model training is the process of training a machine learning model to make predictions based on data. The model training process includes the following steps:

Data preprocessing
Model training
Model evaluation

📓 The EDA and model training will be performed in a Jupyter Notebook notebook.ipynb. In this notebook, we will perform the following steps:

Exploratory Data Analysis (EDA)
Data preprocessing
Model training
- Different algorithms (tree-based and linear models)
- Different hyperparameters
Model evaluation
Model selection

All required steps for setting up the virtual environment to run the notebook are described in the 🛠️ Virtual Environment Setup section.

🐍 For training the final model after the model selection step we will extract the required Python code to a script called train.py. This notebook will be used to train the final model and save it to disk. Steps covered in the train.py script will be:

Data preprocessing
Model training
Model evaluation
Model storage

Virtual Environment Setup

🐍 Setting up the virtual environment for 📊 EDA and 🧠 Model Training using Miniconda with Python 3.10.12. All required packages will be installed from the requirements-eda.txt file. There the packages are listed with their version number to ensure reproducibility.

Installing Miniconda
- Follow the instruction from for your host OS:
  https://docs.conda.io/projects/miniconda/en/latest/

Creating the virtual environment using Python 3.10.12

conda create --name mlzoomcamp-midterm python=3.10.12

Activating the virtual environment
```
conda activate mlzoomcamp-midterm
```
- The command prompt should now indicate that the virtual environment is activated and show the name of the virtual environment in parentheses (mlzoomcamp-midterm).
  Within the activated virtual environment (mlzoomcamp-midterm) perform the following steps:
  1. Install the requirements from the requirements-eda.txt
```
pip install -r requirements-eda.txt
```
  2. Start JupyterLab to check if the installation was successful
```
jupyter lab
```

Additional information:
The commands above worked in WSL2 (Windows Subsystem for Linux) on Windows 11 and should be the same on Linux. The conda version installed on my system is 23.0.9 (Conda command reference 23.9.x.
In case you are using a different conda version and the conda commands do not work on your system, check the conda cheat-sheet of your installed conda version for the correct commands.

Run Jupyter Notebook

Information on ▶️ running the 📓 Jupyter Notebook notebook.ipynb.

The previous created virtual environment (mlzoomcamp-midterm) has JupyterLab installed. In order to start JupyterLab, the virtual environment needs to be activated first. Activate the virtual environment that we created in the previous section 🛠️ Virtual Environment Setup.

# navigate to project directory, the location and command might differ on your system
cd CDC-Diabetes-Health-Indicators

# activate the virtual environment using 'mlzoomcamp-midterm
conda activate mlzoomcamp-midterm

# within the activated environment indicated by '(mlzoomcamp-midterm)' start JupyterLab 
jupyter lab

Information from EDA and Model Training

💡Insights and Results from the 📊 EDA (exploratory data analysis) and the 🧠 Model Training.

EDA - Variables (Target and Features)

💡Information revealed during the 📊 EDA from the dataset and its metadata.

The following data was retrieved after downloading the dataset in the EDA notebook.ipynb using the ucimlrepo Python package and storing the 'variables' information to dataset/variables.csv. In order to not duplicate data in multiple files the following data hast been stored here and not in the EDA notebook.

	Type	Description
ID	Integer	Patient ID

Target variable	Type	Description
Diabetes_binary	Binary	0 = no diabetes 1 = prediabetes or diabetes

Features in are sorted in the table below using their data type:

Integer
Categrical
Binary

Information for binary features (except for feature Sex):

0 = no

1 = yes

Features	Type	Description
BMI	Integer	Body Mass Index
MentHlth	Integer	Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good? scale 1-30 days
PhysHlth	Integer	Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good? scale 1-30 days

GenHlth	Integer (Categorical)	Would you say that in general your health is: scale 1-5 1 = excellent 2 = very good 3 = good 4 = fair 5 = poor
Age	Integer (Categorical)	Age,13-level age category (_AGEG5YR see codebook) 1 = 18-24 9 = 60-64 13 = 80 or older
Education	Integer (Categorical)	Education level (EDUCA see codebook) scale 1-6 1 = Never attended school or only kindergarten 2 = Grades 1 through 8 (Elementary) 3 = Grades 9 through 11 (Some high school) 4 = Grade 12 or GED (High school graduate) 5 = College 1 year to 3 years (Some college or technical school) 6 = College 4 years or more (College graduate)
Income	Integer (Categorical)	Income scale (INCOME2 see codebook) scale 1-8 1 = less than $10,000 5 = less than $35,000 8 = $75,000 or more"

Sex	Binary	Sex, 0 = female 1 = male
HighBP	Binary	High blood preasure
HighChol	Binary	High cholesterol
CholCheck	Binary	Cholesterol check in 5 years
Smoker	Binary	Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes]
Stroke	Binary	(Ever told) you had a stroke.
HeartDiseaseorAttack	Binary	Coronary heart disease (CHD) or myocardial infarction (MI)
PhysActivity	Binary	Physical activity in past 30 days - not including job<
Fruits	Binary	Consume Fruit 1 or more times per day
Veggies	Binary	Consume Vegetables 1 or more times per day
HvyAlcoholConsump	Binary	Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week)
AnyHealthcare	Binary	"Have any kind of health care coverage, including health insurance, prepaid plans such as HMO, etc.
NoDocbcCost	Binary	Was there a time in the past 12 months when you needed to see a doctor but could not because of cost?
DiffWalk	Binary	Do you have serious difficulty walking or climbing stairs?

EDA - Missing Values, Duplicates, Imbalances, etc.

💡Information revealed about the dataset during 📊EDA regarding

Missing values
- ✅ As stated in the dataset information, the dataset has no missing values
Duplicates
- ✅ There are duplicate rows in the dataset when not taking into account the patient ID. This is due to the fact that feature variables are categorical, binary and integer. The integer values have either only value ranges between 1 and 30 or are discrete values although the original values were floating point (feature BMI). Therefore these rows represent different patients that just represent the same feature variables due to their nature.
Imbalances
- ✅ The dataset is highly imbalanced with respect to the target varibale
  - 14% (pre-)diabetic
  - 86% non-diabetic

The BMI (body mass index) is calculated using the following formula. The result is a floating point number, but in the dataset the BMI is stored as an integer. This means that the BMI is rounded to the next integer.

$$ BMI_{\text{float}} = \frac{mass_{kg}}{height_{m}^2} $$

$$ BMI_{\text{integer}} = \text{integer}\left(\frac{mass_{kg}}{height_{m}^2} + 0.5 \right) $$

Export Notebook to Python Script

📤 The code for training the final model got exported to the 🐍 Python script train.py. The script will covers the following tasks:

Loading the dataset splits: train, validation, and test
Creating a test dataset consisting of the test split
Creating a 'full training' dataset consisting of training split and validation split
Training the model on the 'full training' (train + validation) dataset
Evaluating the model on the test dataset
- Printing the metrics to the command line
Saving the following data to files (bin and json)
- Model
- DictVectorizer (fitted on 'full training' dataset)
- Normalization values (determined on 'full training' dataset in order to normalize the value ranges of some feature variables)

🐍 For running the train.py script make sure the development environment defined in section 🛠️ Virtual Environment Setup is activated before running the following commands.

# 🐍 Activate the development environment
conda activate mlzoomcamp-midterm

# ▶️ Execute the training script
python train.py

🎲 We will now randomly sample an entry from the test dataset for testing the model later on. For this purpose the script sample_from_test.py is used. The script will randomly sample a test dataset entry (row) and store it as JSON file test_sample.json.

# 🐍 Activate the development environment, if not already done
conda activate mlzoomcamp-midterm

# 🎲 sample randomly without seed point
python sample_from_test.py

# 🎲 sample randomly using a specific seed point
python sample_from_test.py --seed 1234

This test_sample.json will be used when testing the model during the next step the Model Deployment.

Model Deployment

🧩 For deploying create a new virtual environment for testing the deployment.

Create the a environment using Python 3.10.12

conda create --name deployment-midterm python=3.10.12

Activate the virtual environment
```
conda activate deployment-midterm
```
Within the activated virtual environment (deployment-midterm) install the requirements from the requirements-deployment.txt
```
pip install -r requirements-deployment.txt
```

Test Model Deployment

⚙️ Test the deployment script starting the predict service will require two terminal windows

Terminal windows #1: Run the predict service

# activate 'deployment-midterm', if not already activated 
conda activate deployment-midterm 
# start the predict service
python predict.py

Terminal window #2: Execute the http-request using test_sample.py, which will use the sample from test_sample.json

# activate 'deployment-midterm', if not already activated 
conda activate deployment-midterm
# test the predict service using the sample from 'test_sample.json'
python test_predict.py

Containerization

🐋 Putting the prediction service in a Docker container, which requires Docker being installed on your system.

Create Pipfile and Pipfile.lock

🛠️ Create a Pipfile and Pipfile.lock for containerization using pipenv

Install pipenv
```
pip install pipenv==2023.10.24
```
Create a Pipfile and Pipfile.lock based on the provided requirements-eda.txt
```
pipenv install -r requirements-deployment.txt
```

Run Docker Container and Test the Service

The Docker image docker pull ai2ys/mlzoomcamp-midterm-project:0.0.0 has been pushed to the 🐋 DockerHub registry. Therefore you can run the container without prior building by just running the container, which will pull the image from DockerHub.

Optional: Building the Docker image ai2ys/mlzoomcamp-midterm-project:0.0.0
```
docker build -t ai2ys/mlzoomcamp-midterm-project:0.0.0 .
```

Running the Docker container (terminal windows #1)

docker run --rm -p 9696:9696 ai2ys/mlzoomcamp-midterm-project:0.0.0

Testing the prediction service in the Docker container from the virtual environment (deployment-midterm).
Open a new terminal window and execute the following commands (terminal window #2)
```
# activate the virtual environment
conda activate deployment-midterm
# run the test script
python test_predict.py	
```

Cloud Deployment

☁️ Instructions for the cloud deployment using AWS Elastic Beanstalk.

Prerequisites

Amazon AWS Account
For this task an AWS account is required. Please create an AWS account following the instructions from Machine Learning Bookcamp - Creating an AWS Account.

Installing the EB CLI (elastic beanstalk command line interface)

# create virtual environment for AWS Elastic Beanstalk
conda create --name awsebcli python=3.10.12 

# install AWS Elastic Beanstalk CLI
pip install awsebcli==3.20.10

# activate the virtual environment
conda activate awsebcli

# check AWS Elastic Beanstalk CLI version
eb --version

Run and Test Cloud Deployment

🎞️ Video of cloud deployment showing all steps below: 🔗 https://youtu.be/eu-TP17kvwc

▶️ Steps for creating and running the prediction service on AWS Elastic Beanstalk.

Initialize AWS Elastic Beanstalk project

# activate the virtual environment
conda activate awsebcli
# initialize eb, select region, specify credentials
eb init -p "Docker running on 64bit Amazon Linux 2023" -r eu-west-1 --profile <profile> mlzoomcamp-midterm-project

Test locally using Elastic Beanstalk

Terminal window #1: Using AWS Elastic Beanstalk to run the service locally

# activate the virtual environment
conda activate awsebcli
eb local run --port 9696

Terminal window #2: Run the test_predict.py script

# activate the virtual environment 'deployment-midterm'
conda activate deployment-midterm
python test_predict.py

Test cloud deployment using Elastic Beanstalk
1. Terminal window #1: Create the Elastic Beanstalk environment
```
# activate the virtual environment 'awsebcli'
conda activate awsebcli
eb create mlzoomcamp-midterm-env
```
  When the service is running copy the URL to the clipboard 📋
2. Terminal window #2: Run the test_predict.py script In another terminal run the test_predict.py and insert the URL
```
# activate the virtual environment 'deployment-midterm'
conda activate deployment-midterm
python test_predict.py --url <elastic beanstalk url>
```

When we are done running the prediction service on AWS Elastic Beanstalk

# if not already using activate the virtual environment
conda activate awsebcli
eb terminate mlzoomcamp-midterm-env

MLZoomCamp Midterm Project General Information

General information about he MLZoomCamp Midterm Project can be found here: https://github.com/DataTalksClub/machine-learning-zoomcamp/tree/master/projects#midterm-project

Information for cohort 2023 can be found here: https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/cohorts/2023/projects.md#midterm-project

ai2ys/CDC-Diabetes-Health-Indicators

CDC Diabetes Health Indicators(MLZoomCamp midterm project)