Binary Prediction of Smoker Status

Problem Description

The goal of this project is to create a machine learning model to predict a patient's smoking status using various bio-signals. This is part of a Kaggle Playground Series Competition. Our model will output the probability that a given patient smokes. We will evaluate the model's performance on area under the ROC curve using Stratified KFold validation.

Background

Smoking's well-established adverse effects on health are unquestionable, making it a leading cause of preventable global morbidity and mortality by 2018. A World Health Organization report forecasts that smoking-related deaths will reach 10 million by 2030. Although evidence-based smoking cessation strategies have been advocated, their success remains limited, with traditional counseling often considered ineffective and time-consuming. To address this, various factors have been proposed to predict an individual's likelihood of quitting, but their application yields inconsistent results. A solution lies in developing predictive models using machine learning techniques, a promising approach in recent years for understanding an individual's chances of quitting smoking and improving public health outcomes.

Data

We will be combining 2 datasets for this project.

Kaggle Competition Data: This data was provided by Kaggle for the competition and was synthetically generated using a deep learning model. The deep learning model was trained using the data from the second dataset. It contains a train and test set, where the target column smoking is missing from the test set.
Body signal of smoking: This data was used to train the deep learning model which generated the data in the first dataset. I will be using this dataset to increase the size of the my training set. Hopefully, this will improve my model's performance.

Feature Descriptions

Name	Description
Age	Age of patient, grouped by 5-year increments
Height	Height of patient, grouped by 5-cm increments
Weight	Weight of patient, grouped by 5-kg increments
Waist	Waist circumference in cm
Eyesight (left)	Visual acuity in left eye from 0.1 to 2.0 (higher is better), where 1.0 is equivalent to 20/20, blindness is 9.9
Eyesight (right)	Visual acuity in right eye from 0.1 to 2.0 (higher is better), where 1.0 is equivalent to 20/20, blindness is 9.9
Hearing (left)	Hearing in left ear where 1=normal, 2=abnormal
Hearing (right)	Hearing in right ear where 1=normal, 2=abnormal
Systolic	Blood pressure, amount of pressure experienced by the arteries when the heart is contracting
Relaxation	Blood pressure (diastolic), amount of pressure experienced by the arteries when the heart is relaxing
Fasting Blood Sugar	Blood sugar level (concentration per 100ml of blood) before eating
Cholesterol	Sum of ester-type and non-ester-type cholesterol
Triglyceride	Amount of simple and neutral lipids in blood
HDL	High Density Lipoprotein, "good" cholesterol, absorbs cholesterol in the blood and carries it back to the liver
LDL	Low Density Lipoprotein, "bad" cholesterol, makes up most of body's cholesterol. High levels of this raise risk for heart disease and stroke.
Hemoglobin	Protein contained in red blood cells that delivers oxygen to the tissues
Urine Protein	Amount of protein mixed in urine
Serum Creatinine	Creatine level, Creatinine is a waste product in your blood that comes from your muscles. Healthy kidneys filter creatinine out of your blood through your urine.
AST	Aspartate transaminase, an enzyme that helps the body break down amino acids. It's usually present in blood at low levels. An increase in AST levels may mean liver damage, liver disease or muscle damage.
ALT	Alanine transaminase, an enzyme found in the liver that helps convert proteins into energy for the liver cells. When the liver is damaged, ALT is released into the bloodstream and levels increase.
GTP	Gamma-glutamyltransferase (GGT), an enzyme in the blood. Higher-than-usual levels may mean liver or bile duct damage.
Dental Caries	Cavities, 0=absent, 1=present
Smoking	0=non-smoker, 1=smoker

Usage

Local Environment Setup

Here your environment can be set up to run the notebook & code locally.

Ensure miniconda/anaconda is installed.
Create the conda environment.

conda env create -f environment.yaml

Activate environment.

conda activate smoker-prediction

Install dependencies with poetry.

poetry install

Docker & Local Testing

Ensure you have Docker installed.
Build the image

docker build -t smoker-prediction .

Run the image

docker run -it- rm -p 9696:9696 smoker-prediction

Test the service locally

python predict-test.py --local

Deployment & Cloud Testing

The model was deployed to AWS Elastic Beanstalk
URL deployed to: smoking-serving-env.eba-rfk3vyqz.us-west-1.elasticbeanstalk.com
To test the service running in the cloud run:

Caution

Service is no longer running

python predict-test.py

Deploying the model yourself

If you want to deploy the model with elastic beanstalk yourself, you can follow the steps below

Create an AWS IAM account to use for this project.
Set up access keys
Initialize the elastic beanstalk environment

eb init -p "Docker running on 64bit Amazon Linux 2023" smoking-serving -r <your-region>

Test locally

eb local run --port 9696

Deploy to cloud

eb create smoking-serving-env -i t3.small --timeout 10

Future Work

The original dataset cites its source as the Korean Government. It appears that additional data can be downloaded here. Since the competition dataset was synthetically generated, will using additional real-world data sources improve accuracy on the competition test set?

Koda98/smoker-status-prediction