Table of Contents:
π This project uses the CDC Diabetes Health Indicators dataset that can be used for training a model to predict if persons are diabetic/pre-diabetic or non-diabetic diabetes based on their heath records. The dataset was created to to better understand the relationship between lifestyle and diabetes in the US and the creation was funded by the CDC (Center for Disease Control and Prevention).
Problem: Diabetes is a severe illness that can lead to serious health problems such as heart disease, blindness, kidney failure, and so on. Detecting the illness in an early stage can help to prevent or delay these health issues.
Task: This midterm project aims to build a service that predicts whether a patient is (pre-)diabetic or healthy using the previous mentioned data provided by the "CDC Diabetes Health Indicators Dataset".
More information and insights about the dataset can be found in the Dataset Information section.
π Information regarding the dataset
- Dataset source and dataset download location
- Features variables
- Target variable
π Dataset page:
The CDC Diabetes Health Indicators dataset is available on the UCI Machine Learning Repository.
Information from the dataset page:
- Each row of the dataset represents a person participating in the study.
- The dataset contains 21 feature variables (categorical and integer) and 1 target variable (binary).
- Cross validation or a fixed train-test split could be used for data splits.
- The dataset contains sensitive data such as gender, income, and education level.
- Data preprocessing was performed by bucketing of age. The dataset has no missing values.
Quoted from the dataset page
"The Diabetes Health Indicators Dataset contains healthcare statistics and lifestyle survey information about people in general along with their diagnosis of diabetes. The 35 features consist of some demographics, lab test results, and answers to survey questions for each patient. The target variable for classification is whether a patient has diabetes, is pre-diabetic, or healthy."
Remark on the quote above
The quote states that the dataset contains 35 features. However, the dataset page further states
Information Dataset Characteristics Tabular, Multivariate Subject Area Life Science Associated Tasks Classification Feature Type Categorical, Integer # Instances 253680 # Features 21
π‘ We will check this discrepancy in when digging into the dataset during the EDA (Exploratory Data Analysis).
β‘οΈ From the dataset information we can see that the task will be a binary classification task with 21 features and 1 target variable.
Download is provided via
- Python API using the
ucimlrepo
package.- When using the
ucimlrepo
package for downloading the data, there is an additional download link provided in itsmetadata.data_url
from which the dataset CSV file can be accessed directly. https://archive.ics.uci.edu/static/public/891/data.csv
- When using the
- On the project page (https://archive.ics.uci.edu/dataset/891/cdc+diabetes+health+indicators) there is a reference to the dataset source which redirects to Kaggle:
https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset
π‘ In this project, we utilize the ucimlrepo
Python package to download the initial dataset. To ensure reproducibility, all relevant data for exploratory data analysis (EDA) and training will be stored locally in the (./dataset
folder. This approach safeguards against potential issues, such as unavailability or changes to the dataset in the UCI Machine Learning Repository over time.
π₯ How the dataset was downloaded and stored locally is described in the EDA notebook notebook.ipynb
. The dataset and parts of the metadata are downloaded the notebook.ipynb
and stored in the ./dataset
folder locally.
- dataframe -
./dataset/data.csv
- information about variables -
./dataset/variables.csv
- metadata (only some parts of it) -
./dataset/metadata_partially.json
π During EDA the generated dataset splits (train, validation, test) have been downloaded to separate files. These will be used later on for running the training on the 'full training' dataset (train + validation) in the final train.py
script.
π EDA stands for Exploratory Data Analysis, which is the process of analyzing data sets to summarize the main characteristics, often including visualizations of the data. EDA is used for seeing what the data can tell us about the data.
π§ Model training is the process of training a machine learning model to make predictions based on data. The model training process includes the following steps:
- Data preprocessing
- Model training
- Model evaluation
π The EDA and model training will be performed in a Jupyter Notebook notebook.ipynb
. In this notebook, we will perform the following steps:
- Exploratory Data Analysis (EDA)
- Data preprocessing
- Model training
- Different algorithms (tree-based and linear models)
- Different hyperparameters
- Model evaluation
- Model selection
All required steps for setting up the virtual environment to run the notebook are described in the π οΈ Virtual Environment Setup section.
π For training the final model after the model selection step we will extract the required Python code to a script called train.py
. This notebook will be used to train the final model and save it to disk. Steps covered in the train.py
script will be:
- Data preprocessing
- Model training
- Model evaluation
- Model storage
π Setting up the virtual environment for π EDA and π§ Model Training using Miniconda with Python 3.10.12
. All required packages will be installed from the requirements-eda.txt
file. There the packages are listed with their version number to ensure reproducibility.
-
Installing Miniconda
- Follow the instruction from for your host OS:
https://docs.conda.io/projects/miniconda/en/latest/
- Follow the instruction from for your host OS:
-
Creating the virtual environment using Python 3.10.12
conda create --name mlzoomcamp-midterm python=3.10.12
-
Activating the virtual environment
conda activate mlzoomcamp-midterm
-
The command prompt should now indicate that the virtual environment is activated and show the name of the virtual environment in parentheses
(mlzoomcamp-midterm)
.
Within the activated virtual environment(mlzoomcamp-midterm)
perform the following steps:- Install the requirements from the
requirements-eda.txt
pip install -r requirements-eda.txt
- Start JupyterLab to check if the installation was successful
jupyter lab
- Install the requirements from the
-
Additional information:
The commands above worked in WSL2 (Windows Subsystem for Linux) on Windows 11 and should be the same on Linux. Theconda
version installed on my system is23.0.9
(Conda command reference23.9.x
.
In case you are using a differentconda
version and theconda
commands do not work on your system, check theconda
cheat-sheet of your installedconda
version for the correct commands.
Information on notebook.ipynb
.
The previous created virtual environment (mlzoomcamp-midterm)
has JupyterLab installed. In order to start JupyterLab, the virtual environment needs to be activated first. Activate the virtual environment that we created in the previous section π οΈ Virtual Environment Setup.
# navigate to project directory, the location and command might differ on your system
cd CDC-Diabetes-Health-Indicators
# activate the virtual environment using 'mlzoomcamp-midterm
conda activate mlzoomcamp-midterm
# within the activated environment indicated by '(mlzoomcamp-midterm)' start JupyterLab
jupyter lab
π‘Insights and Results from the π EDA (exploratory data analysis) and the π§ Model Training.
π‘Information revealed during the π EDA from the dataset and its metadata.
The following data was retrieved after downloading the dataset in the EDA notebook.ipynb
using the ucimlrepo
Python package and storing the 'variables' information to dataset/variables.csv. In order to not duplicate data in multiple files the following data hast been stored here and not in the EDA notebook.
Type | Description | |
---|---|---|
ID | Integer | Patient ID |
Target variable | Type | Description |
---|---|---|
Diabetes_binary | Binary | 0 = no diabetes 1 = prediabetes or diabetes |
Features in are sorted in the table below using their data type:
- Integer
- Categrical
- Binary
Information for binary features (except for feature
Sex
):
0
=no
1
=yes
Features | Type | Description |
---|---|---|
BMI | Integer | Body Mass Index |
MentHlth | Integer | Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good? scale 1-30 days |
PhysHlth | Integer | Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good? scale 1-30 days |
GenHlth | Integer (Categorical) | Would you say that in general your health is: scale 1-5 1 = excellent 2 = very good 3 = good 4 = fair 5 = poor |
Age | Integer (Categorical) | Age,13-level age category (_AGEG5YR see codebook) 1 = 18-24 9 = 60-64 13 = 80 or older |
Education | Integer (Categorical) | Education level (EDUCA see codebook) scale 1-6 1 = Never attended school or only kindergarten 2 = Grades 1 through 8 (Elementary) 3 = Grades 9 through 11 (Some high school) 4 = Grade 12 or GED (High school graduate) 5 = College 1 year to 3 years (Some college or technical school) 6 = College 4 years or more (College graduate) |
Income | Integer (Categorical) | Income scale (INCOME2 see codebook) scale 1-8 1 = less than $10,000 5 = less than $35,000 8 = $75,000 or more" |
Sex | Binary | Sex, 0 = female 1 = male |
HighBP | Binary | High blood preasure |
HighChol | Binary | High cholesterol |
CholCheck | Binary | Cholesterol check in 5 years |
Smoker | Binary | Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes] |
Stroke | Binary | (Ever told) you had a stroke. |
HeartDiseaseorAttack | Binary | Coronary heart disease (CHD) or myocardial infarction (MI) |
PhysActivity | Binary | Physical activity in past 30 days - not including job< |
Fruits | Binary | Consume Fruit 1 or more times per day |
Veggies | Binary | Consume Vegetables 1 or more times per day |
HvyAlcoholConsump | Binary | Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week) |
AnyHealthcare | Binary | "Have any kind of health care coverage, including health insurance, prepaid plans such as HMO, etc. |
NoDocbcCost | Binary | Was there a time in the past 12 months when you needed to see a doctor but could not because of cost? |
DiffWalk | Binary | Do you have serious difficulty walking or climbing stairs? |
π‘Information revealed about the dataset during πEDA regarding
- Missing values
- β As stated in the dataset information, the dataset has no missing values
- Duplicates
- β
There are duplicate rows in the dataset when not taking into account the patient ID. This is due to the fact that feature variables are categorical, binary and integer. The integer values have either only value ranges between 1 and 30 or are discrete values although the original values were floating point (feature
BMI
). Therefore these rows represent different patients that just represent the same feature variables due to their nature.
- β
There are duplicate rows in the dataset when not taking into account the patient ID. This is due to the fact that feature variables are categorical, binary and integer. The integer values have either only value ranges between 1 and 30 or are discrete values although the original values were floating point (feature
- Imbalances
- β
The dataset is highly imbalanced with respect to the target varibale
- 14% (pre-)diabetic
- 86% non-diabetic
- β
The dataset is highly imbalanced with respect to the target varibale
The BMI (body mass index) is calculated using the following formula. The result is a floating point number, but in the dataset the BMI is stored as an integer. This means that the BMI is rounded to the next integer.
π€ The code for training the final model got exported to the π Python script train.py
. The script will covers the following tasks:
- Loading the dataset splits: train, validation, and test
- Creating a test dataset consisting of the test split
- Creating a 'full training' dataset consisting of training split and validation split
- Training the model on the 'full training' (train + validation) dataset
- Evaluating the model on the test dataset
- Printing the metrics to the command line
- Saving the following data to files (bin and json)
- Model
- DictVectorizer (fitted on 'full training' dataset)
- Normalization values (determined on 'full training' dataset in order to normalize the value ranges of some feature variables)
π For running the train.py
script make sure the development environment defined in section π οΈ Virtual Environment Setup is activated before running the following commands.
# π Activate the development environment
conda activate mlzoomcamp-midterm
# βΆοΈ Execute the training script
python train.py
π² We will now randomly sample an entry from the test dataset for testing the model later on. For this purpose the script sample_from_test.py
is used. The script will randomly sample a test dataset entry (row) and store it as JSON
file test_sample.json
.
# π Activate the development environment, if not already done
conda activate mlzoomcamp-midterm
# π² sample randomly without seed point
python sample_from_test.py
# π² sample randomly using a specific seed point
python sample_from_test.py --seed 1234
This test_sample.json
will be used when testing the model during the next step the Model Deployment.
𧩠For deploying create a new virtual environment for testing the deployment.
-
Create the a environment using Python 3.10.12
conda create --name deployment-midterm python=3.10.12
-
Activate the virtual environment
conda activate deployment-midterm
Within the activated virtual environment
(deployment-midterm)
install the requirements from therequirements-deployment.txt
pip install -r requirements-deployment.txt
βοΈ Test the deployment script starting the predict service will require two terminal windows
-
Terminal windows #1: Run the predict service
# activate 'deployment-midterm', if not already activated conda activate deployment-midterm # start the predict service python predict.py
-
Terminal window #2: Execute the http-request using
test_sample.py
, which will use the sample fromtest_sample.json
# activate 'deployment-midterm', if not already activated conda activate deployment-midterm # test the predict service using the sample from 'test_sample.json' python test_predict.py
π Putting the prediction service in a Docker container, which requires Docker being installed on your system.
π οΈ Create a Pipfile and Pipfile.lock for containerization using pipenv
- Install
pipenv
pip install pipenv==2023.10.24
- Create a
Pipfile
andPipfile.lock
based on the providedrequirements-eda.txt
pipenv install -r requirements-deployment.txt
The Docker image docker pull ai2ys/mlzoomcamp-midterm-project:0.0.0
has been pushed to the π DockerHub registry. Therefore you can run the container without prior building by just running the container, which will pull the image from DockerHub.
-
Optional: Building the Docker image
ai2ys/mlzoomcamp-midterm-project:0.0.0
docker build -t ai2ys/mlzoomcamp-midterm-project:0.0.0 .
-
Running the Docker container (terminal windows #1)
docker run --rm -p 9696:9696 ai2ys/mlzoomcamp-midterm-project:0.0.0
-
Testing the prediction service in the Docker container from the virtual environment
(deployment-midterm)
.
Open a new terminal window and execute the following commands (terminal window #2)# activate the virtual environment conda activate deployment-midterm # run the test script python test_predict.py
βοΈ Instructions for the cloud deployment using AWS Elastic Beanstalk.
-
Amazon AWS Account
For this task an AWS account is required. Please create an AWS account following the instructions from Machine Learning Bookcamp - Creating an AWS Account. -
Installing the EB CLI (elastic beanstalk command line interface)
# create virtual environment for AWS Elastic Beanstalk conda create --name awsebcli python=3.10.12 # install AWS Elastic Beanstalk CLI pip install awsebcli==3.20.10 # activate the virtual environment conda activate awsebcli # check AWS Elastic Beanstalk CLI version eb --version
ποΈ Video of cloud deployment showing all steps below: π https://youtu.be/eu-TP17kvwc
-
Initialize AWS Elastic Beanstalk project
# activate the virtual environment conda activate awsebcli # initialize eb, select region, specify credentials eb init -p "Docker running on 64bit Amazon Linux 2023" -r eu-west-1 --profile <profile> mlzoomcamp-midterm-project
-
Test locally using Elastic Beanstalk
- Terminal window #1: Using AWS Elastic Beanstalk to run the service locally
# activate the virtual environment conda activate awsebcli eb local run --port 9696
- Terminal window #2: Run the
test_predict.py
script# activate the virtual environment 'deployment-midterm' conda activate deployment-midterm python test_predict.py
- Terminal window #1: Using AWS Elastic Beanstalk to run the service locally
-
Test cloud deployment using Elastic Beanstalk
-
Terminal window #1: Create the Elastic Beanstalk environment
# activate the virtual environment 'awsebcli' conda activate awsebcli eb create mlzoomcamp-midterm-env
When the service is running copy the URL to the clipboard π
-
Terminal window #2: Run the
test_predict.py
script In another terminal run thetest_predict.py
and insert the URL# activate the virtual environment 'deployment-midterm' conda activate deployment-midterm python test_predict.py --url <elastic beanstalk url>
-
-
When we are done running the prediction service on AWS Elastic Beanstalk
# if not already using activate the virtual environment conda activate awsebcli eb terminate mlzoomcamp-midterm-env
General information about he MLZoomCamp Midterm Project can be found here: https://github.com/DataTalksClub/machine-learning-zoomcamp/tree/master/projects#midterm-project
Information for cohort 2023 can be found here: https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/cohorts/2023/projects.md#midterm-project