NIDDK (National Institute of Diabetes and Digestive and Kidney Diseases) research creates knowledge about and treatments for the most chronic, costly, and consequential diseases. The dataset used in this project is originally from NIDDK. The datasets consists of several medical predictor variables and one target variable (Outcome). Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and more.
Pregnancies -> Number of times pregnant
Glucose -> Plasma glucose concentration in an oral glucose tolerance test
BloodPressure -> Diastolic blood pressure (mm Hg)
SkinThickness -> Triceps skinfold thickness (mm)
Insulin -> Two hour serum insulin
BMI -> Body Mass Index
DiabetesPedigreeFunction -> Diabetes pedigree function
Age -> Age in years
Outcome -> Class variable (either 0 or 1). 268 of 768 values are 1, and the others are 0
- Descriptive analysis of the variables and their corresponding values. On the columns below, a value of zero does not make sense and thus indicates missing value:
• Glucose
• BloodPressure
• SkinThickness
• Insulin
• BMI
-
A count (frequency) plot describing the data types and the count of variables.
-
Visualize the variables using histograms and treated the missing values accordingly.
-
Displays the balance of the data by plotting the count of outcomes by their value and describe the findings and plan a future course of action.
-
Scatter charts between the pair of variables to understand the relationships.
-
Correlation analysis using a heat map.
a. Pie chart to describe the diabetic or non-diabetic population
b. Histogram or frequency chart to analyze the distribution of the data
c. Created bins of these age values: 20-25, 25-30, 30-35, etc to analyze different variables for these age brackets using a bubble chart.
d. Heatmap of correlation analysis among the relevant variables