This is the Mini-Project done for SC1015 Introduction to Data Science and Artificial Intelligence. In this project, we aim to identify individuals with high risk of hypertension who are undiagnosed, using Data Science and Machine Learning algorithms.
The dataset we chose is taken from 2021 Behavioural Risk Factor Surveillance System Survey Data and Documentation conducted by US Centers for Disease Control and Prevention (CDC).
The documents and source codes are presented in the following order:
- Data Extraction & Cleaning
- Exploratory Data Analysis
- Model 1: Decision Tree
- Model 2: Random Forest
- Model 3: Logistic Regression
- B133 Team 5 Video Slides
- Dataset & Codebook
Practical Motivation & Problem Formulation
Decision Tree with Optimisation
Data Cleaning, Exploratory Data Analysis & Analytic Visualisation
Random Forest with Optimisation & Logistic Regression with Optimisaion
In Singapore, hypertension is the most common medical condition, with 35.5% of adults being diagnosed in 2020. Hypertension is also the leading risk factor for cardiovascular disease and death globally, and untreated hypertension can lead to heart disease which can be fatal.
This leads to our problem statement: What are the variables correlated with hypertension and how can we identify undiagnosed individuals suffering from hypertension?
- How can we identify undiagnosed individuals suffering from hypertension?
- Which models are most suitable for machine learning?
Data cleaning and preparation are crucial steps in the data analysis process. These steps involve transforming raw data into a format that is suitable for analysis, and ensuring that the data is accurate, complete, and consistent.
Key steps involved in our data cleaning and preparation process include:
- Extract columns that are relevant to our problem, i.e., hypertension analysis
- Tackle missing values
- Tackle irrelevant data entries
- Create new variables by combining or transforming existing numeric variables
- Decode categorical variables based on data description
- Compute true values for numeric variables
- Identify and remove outliers for numeric variables
- Export cleaned dataset to csv
After data cleaning, Exploratory Data Analysis (EDA) and Visualisation are conducted to better understand the variables and obtain initial data-driven insights on their relationships with hypertension.
Key steps involved in our EDA and Visualisation include:
- Uni-variate analysis and visualisation for both numeric and categorical variable
- Bi-variate analysis and visualisation to identify promising predictor numeric variables via boxplots
- Bi-variate analysis and visualisation to identify promising predictor categorical variables via catplots
The models were chosen by considering the large data set and the high number of categorical variables. The following models were used:
- Constructing a model of decisions and their possible consequences
- Optimising depth level using AUC value of ROC Curve
- Using multiple decision trees to diversify train data set
- Optimising depth level using AUC value of ROC Curve
- Predicting output of a categorical variable based on multiple independent variables
- Optimisation through hyperparameters
- Identifying best hyperparameters through GridSearch
- Logistic Regression yielded the best results with the highest prediction accuracy of 0.73 after optimisation.
- A correlation between arthritis and hypertension was found although the link cannot be explained by scientific knowledge.
- A higher proportion of black people suffer from hypertension compared to all other racial demographics. This suggests that hypertension can be caused by genetic factors.
- People with and without hypertension have the same average intake of fruits and vegetables. This suggests that there are other more significant factors that contribute to hypertension.
- Cholesterol, BMI and age are the three most significant contributors to hypertension based on our 3 models.
https://www.cdc.gov/brfss/annual_data/annual_2021.html
To download the raw data file, please access: https://www.dropbox.com/s/veea7xd97rc7kdv/raw.csv?dl=0
https://scikit-learn.org/stable/modules/tree.html https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc https://www.section.io/engineering-education/introduction-to-random-forest-in-machine-learning/#:~:text=What%20is%20a%20random%20forest,consists%20of%20many%20decision%20trees https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html https://www.kaggle.com/code/funxexcel/p2-logistic-regression-hyperparameter-tuning/notebook
https://www.moh.gov.sg/docs/librariesprovider5/default-document-library/nphs-2020-survey-report.pdf https://my.clevelandclinic.org/health/articles/11918-cholesterol-high-cholesterol-diseases#:~:text=Cholesterol%20plaque%20and%20calcium%20cause,biggest%20causes%20of%20heart%20disease https://www.cdc.gov/bloodpressure/about.htm#:~:text=The%20higher%20your%20blood%20pressure,%2C%20heart%20attack%2C%20and%20stroke https://www.cdc.gov/bloodpressure/about.htm#:~:text=The%20higher%20your%20blood%20pressure,%2C%20heart%20attack%2C%20and%20stroke https://www.medicalnewstoday.com/articles/hypertension-and-asthma#:~:text=People%20with%20asthma%20are%20more,developing%20hypertension%20and%20heart%20disease https://www.amjmed.com/article/S0002-9343(09)00208-3/fulltext#:~:text=Arthritis%20pain%20often%20occurs%20concurrently,to%20treat%20pain%20and%20inflammation https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4108512/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4703088/