Hypertension Classifier using Machine Learning and Optimisation Techniques

About

This is the Mini-Project done for SC1015 Introduction to Data Science and Artificial Intelligence. In this project, we aim to identify individuals with high risk of hypertension who are undiagnosed, using Data Science and Machine Learning algorithms.

The dataset we chose is taken from 2021 Behavioural Risk Factor Surveillance System Survey Data and Documentation conducted by US Centers for Disease Control and Prevention (CDC).

The documents and source codes are presented in the following order:

Contributors

Team:

Practical Motivation & Problem Formulation

Li Liyi:

Decision Tree with Optimisation

Wu Rixin:

Data Cleaning, Exploratory Data Analysis & Analytic Visualisation

Yong Shao En Ernest:

Random Forest with Optimisation & Logistic Regression with Optimisaion

Problem Definition

In Singapore, hypertension is the most common medical condition, with 35.5% of adults being diagnosed in 2020. Hypertension is also the leading risk factor for cardiovascular disease and death globally, and untreated hypertension can lead to heart disease which can be fatal.

This leads to our problem statement: What are the variables correlated with hypertension and how can we identify undiagnosed individuals suffering from hypertension?

How can we identify undiagnosed individuals suffering from hypertension?
Which models are most suitable for machine learning?

Data Cleaning and Preparation

Data cleaning and preparation are crucial steps in the data analysis process. These steps involve transforming raw data into a format that is suitable for analysis, and ensuring that the data is accurate, complete, and consistent.

Key steps involved in our data cleaning and preparation process include:

Extract columns that are relevant to our problem, i.e., hypertension analysis
Tackle missing values
Tackle irrelevant data entries
Create new variables by combining or transforming existing numeric variables
Decode categorical variables based on data description
Compute true values for numeric variables
Identify and remove outliers for numeric variables
Export cleaned dataset to csv

Exploratory Data Analysis and Visualization

After data cleaning, Exploratory Data Analysis (EDA) and Visualisation are conducted to better understand the variables and obtain initial data-driven insights on their relationships with hypertension.

Key steps involved in our EDA and Visualisation include:

Uni-variate analysis and visualisation for both numeric and categorical variable
Bi-variate analysis and visualisation to identify promising predictor numeric variables via boxplots
Bi-variate analysis and visualisation to identify promising predictor categorical variables via catplots

Machine Learning Models

The models were chosen by considering the large data set and the high number of categorical variables. The following models were used:

1. Decision Tree

Constructing a model of decisions and their possible consequences
Optimising depth level using AUC value of ROC Curve

2. Random Forest

Using multiple decision trees to diversify train data set
Optimising depth level using AUC value of ROC Curve

3. Logistic Regression

Predicting output of a categorical variable based on multiple independent variables
Optimisation through hyperparameters
Identifying best hyperparameters through GridSearch

Conclusion

Logistic Regression yielded the best results with the highest prediction accuracy of 0.73 after optimisation.
A correlation between arthritis and hypertension was found although the link cannot be explained by scientific knowledge.
A higher proportion of black people suffer from hypertension compared to all other racial demographics. This suggests that hypertension can be caused by genetic factors.
People with and without hypertension have the same average intake of fruits and vegetables. This suggests that there are other more significant factors that contribute to hypertension.
Cholesterol, BMI and age are the three most significant contributors to hypertension based on our 3 models.

References

liliyigz/22S2-SC1015-Data-Science-and-AI