💕💔 Heart Disease EDA & Prediction 🔮

w/ Logistic Regression using SAS Studio 🖥

📃 Table of Contents:

About Project
Objectives
Dataset Description
- Dataset Summary
- Univariate Analysis
  - Categorical
  - Numerical
- EDA 1
- EDA 2
- EDA 3
- EDA 4
- EDA 5
Dataset Pre-processing
Logistic Regression
Output Delivery System (ODS)

🖋 About Project:

👉 This dataset contains information about contains diagnoses of heart disease patients. Machine learning model is needed in order to determine whether a person has heart disease or not.

📌 Objectives:

Perform dataset exploration using various type of visualizations.
Perform EDA on given dataset.
Build logistic regression model to predict heart disease status.

🧾 Dataset Description:

👉 There are 14 variables in this dataset:

9 categorical variables, and
5 continuous variables.

👉 The structure of the two datasets that have been given:

Variable Name	Description	Sample Data
Age	Patient Age (in years)	63; 37; ...
Sex	Gender of patient (0 = male; 1 = female)	1; 0; ...
cp	Chest pain type (4 values: 0, 1, 2, 3)	3; 1; 2; ...
trestbps	resting blood pressure (in mm Hg)	145; 130; ...
chol	Serum cholestoral (in mg/dl)	233; 250; ...
fbs	Fasting blood sugar > 120 mg/dl (1 = true; 0 = false)	1; 0; ...
restecg	Resting electrocardiographic results (values 0, 1, 2)	0; 1; ...
thalach	Maximum heart rate achieved	150; 187; ...
exang	Exercise induced angina (1 = yes; 0 = no)	1; 0; ...
oldpeak	ST depression induced by exercise relative to rest	2.3; 3.5; ...
slope	The slope of the peak exercise ST segment (values 0, 1, 2)	0; 2; ...
ca	number of major vessels (0-4) colored by flourosopy	0; 3; ...
thal	(3 = normal; 6 = fixed defect; 7 = reversable defect)	1; 3; ...
Target	Target column (1 = Yes; 0 = No)	1; 0; ...

📊 EDA:

🏛 Dataset Summary:

As mentioned above, there are 14 variables with 303 observations.

🔍 Univariate Analysis:

▶ Univariate - Categorical:

sex (Gender)
- The distribution of male patients are highest compared to female patients.
cp (Chest Pain Type)
- Chest pain type 0 have the highest number compared to other types of chest pain.
fbs (Fasting Blood Sugar)
- It can be seen that the number of patients with fasting blood sugar less than 120 mg/dl have the highest numbers.
restecg (Resting Electrocardiographic Results)
- Resting electrocardiographic with results 1 and 0 has a higher distribution than result 2.
- In addition, result 1 has the highest distribution compared to the other results.
exang (Exercise Induced Angina)
- Patients with no exercise induced angina are the highest compared to patients with exercise induced angina.
slope (Slope of the Peak Exercise)
- The distribution of slope 1 and 2 are almost the same.
- Moreover, slope 2 has the highest distribution compared to others.
ca (Number of Major Vessels)
- People with 0 major vessel has the highest distribution compared to others.
thal
- Patients with 2 "thal" has the highest distribution compared to others.
target (Heart Diseases Status)
- The total number of patients that have heart diseases are higher than patients that have no heart diseases.

▶ Univariate - Numerical:

age (Patient Age)
- From the histogram and boxplot, it can be seen that this column is normally distributed. This also proven by skewness value (-0.2) of this column.
- In this column, the kurtosis value is -0.5, which indicates that the column is platikurtic.
- From the Q-Q plot, the data values tend to closely follow the 45-degree, which means the data is likely normally distributed (as stated previously).
trestbps (Resting Blood Pressure in mm Hg)
- From the histogram, it can be seen that this column is moderatly right skewed. This also proven by skewness value (0.7) of this column.
- There are some outliers detected at the upper part of boxplot.
- At the upper part of Q-Q plot, the data values tend to move away from 45-degree (there is a gap at upper part of Q-Q plot with 45-degree line), which means the data is likely moderatly right skewed (as stated previously).
- In this column, the kurtosis value is 0.9, which indicates that the column is platikurtic.
chol (Serum Cholestoral in mg/dl)
- From the histogram, it can be seen that this column is highly right skewed. This also proven by skewness value (1.1) of this column.
- There are some outliers detected at the upper part of boxplot.
- At the upper part of Q-Q plot, there is a gap at upper part of Q-Q plot with 45-degree line, which means the data is likely highly right skewed (as stated previously).
- In this column, the kurtosis value is 4.5, which indicates that the column is leptokurtic.
thalach (Maximum Heart Rate)
- From the histogram, it can be seen that this column is moderatly left skewed. This also proven by skewness value (-0.5) of this column.
- There is an outlier detected at the bottom part of boxplot.
- At the upper part of Q-Q plot, there is a gap at bottom part of Q-Q plot with 45-degree line, which means the data is likely moderatly left skewed (as stated previously).
- In this column, the kurtosis value is -0.06, which indicates that the column is platikurtic.
oldpeak
- From the histogram, it can be seen that this column is highly right skewed. This also proven by skewness value (1.3) of this column.
- There are some outliers detected at the upper part of boxplot.
- At the upper part of Q-Q plot, there is a gap at bottom part of Q-Q plot with 45-degree line, which means the data is likely highly right skewed (as stated previously).
- In this column, the kurtosis value is 1.57, which indicates that the column is platikurtic.

1️⃣ EDA 1:

2️⃣ EDA 2:

3️⃣ EDA 3:

4️⃣ EDA 4:

5️⃣ EDA 5:

⚙ Dataset Pre-processing:

In the data pre-processing, one-hot encoding performed for these columns:
- cp (into cp_0, cp_1, cp_2, and cp_3)
- thal (into thal_0, thal_1, thal_2, and thal_3)
- slope (into slope_0, slope_1, and slope_2)
After one-hot encoding performed, original columns (cp, thal, and slope) are dropped from the table.
Then, the observations will be splitted into 80% train and 20% test ratio using PROC SURVEYSELECT technique.
Next, the new columns (Selected) will be dropped in both train and test data.
Finally, the target values in test set will be change into NULL values.

Each step for data pre-processing are available on part no. 3 in main.sas file.

👨‍💻 Logistic Regression:

▶ Building Logistic Regression Model:

[Image 1] - In train set, there are 243 observations (no missing values detected). In addition, the number of patients with and without heart disease are equally balanced.
[Image 2] - The "Model Convergence Status" is Satisified, indicates that the developed logistic regression is good predictor in predicting patients status. This convergence status also supported from smaller AIC value compared to SC value.
[Image 3] - p-value under the column "Pr > ChiSq", that not all variables are significant in the model. The p-value has to be less than 0.05 in order for the variable to be significantly impacting the variation in the heart disease status. (Example of great values for prediction: sex, cp_0, exang, etc.)

▶ Probability in Training:

▶ Predictions on Test:

📥 Output Delivery System:

Output Delivery System (ODS) is used to present the output data from SAS program in the form of a nicely presented report which would hep the user to be able to understand the output of their analysis much easier. For this case, the prediction exported as PDF file (.pdf)
The prediction report can be seen here.

Each step for creating output (ODS) file are available on part no. 5 in main.sas file.

🙌 Support me!

👉 If you find this project useful, please ⭐ this repository 😆!

🎈 Check out my work on Kaggle here using various machine learning models!

👉 More about myself: here

caesarmario/heart-disease-prediction-with-logistic-regression-SAS-studio