In the first part of the project, I analyzed the dataset, addressed missing values and generated visual charts by using the Microsoft Excel Data Analysis tool.
For cholestoral:
Correlation Matrix (Pearson):
Normal Probability Plot after handling missing values:
In the second part, The objective of the project is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset bu using IBM SPSS Modeler. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.
Pima Indians Diabetes Database can be downloaded from here.
The schema of the project in IBM SPSS Modeler:
Test scores before performing anomaly detection algorithms:
Test scores after performing anomaly detection algorithms:
I ran different models like: C5.0, CART, Random Forest, Quest, Regression. I got the best test score with CHAID (Chi-squared Automatic Interaction Detection) model:
The final schema of the project:
The hyperparameters of the model:
- Top 3 decisive features in predicting the target was Glucose, Age, BMI.