/Udacity-Pima-Indians-Diabetes-Dataset

Data Visualization Project

Primary LanguageHTMLMIT LicenseMIT

Communicate Data Findings

by Chukwunonso Emmanuel Chukwumaeze


Pima Indians Diabetes Dataset

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases, version was downloaded from Kaggle

The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

Several constraints were placed on the selection of these instances from a larger database. __In particular:

  • All Patients here are females
  • All Patients are at least 21 years old of Pima Indian heritage.__

I chose to work on all variables:

The dataset contains about 768 rows with 9 columns. For my analysis I would be working on all rows and columns which includes:

  • pregnancies : Number of times pregnant
  • glucose : Plasma glucose concentration in a 2 hour oral glucose tolerance test. (mg/dL) Read More about this test here
  • blood_pressure : Diastolic blood pressure (mmHg)
  • skin_thickness : Triceps skin fold thickness (mm)
  • insulin : 2-Hour serum insulin(mu U/ml)Read More about Insulin Tests here
  • bmi : Body Mass Index (weight in kg/(height in m)^2)
  • diabetes_pedigree_function : indicates the function which scores likelihood of diabetes based on family history.
  • age : everyone in this dataset is above 21 years old.
  • diabetes : whether individuals have diabetes or not, 0 - No, 1- Yes.

Summary of Findings


In the Exploration I found that the variables that showed significant difference with regards to the diabetic outcome included the Insulin, Glucose, BMI and Diastolic Blood Pressure. I also found out that the number of pregnancies didn't correlate with BMI or skin thickness and that there was a positive correlation between BMI and skin thickness.

I had to engineer a new categorical column, grouping the BMI into health-significant categories according to the CDC guidelines

Upon further investigation I found out that there was no significant relationship between the Diabetes Pedigree Function and the diabetic status of the patients. I also discovered a unique age distribution among diabetic patients.

Key Insights for Presentation


For the presentation I focus on the variables which have the strongest correlation with the diabetes status of a patient.

I start by looking at the distribution of the diabetes status of the patients, then exploring the BMI distribution of patients across the diabetes status.

I further go into investigating the influence that age, glucose and insulin levels have on the diabetes outcome of the patients.

I would later go on to investigate the combined effects of Age and BMI on diabetes and the correlation between Glucose, Insulin, BMI and the diabetes status of patients.

I made sure to use positional and non-positional encodings where necessary to highlight the interactions between different variables.


How to Navigate:

  • The notebook Part 1 exploration template contains the Exploratory Data Analysis
  • The notebook "Part II slide deck template" contains a slide presentation of my findings, the Explanatory Data Analysis
  • You could view the slide presentation, you have to have anaconda installed, with the Anaconda Prompt, go into the folder into which downloaded the "Part_II_slide_deck_template.ipynb" notebook and run the following command !jupyter nbconvert Part_II_slide_deck_template.ipynb --to slides --post serve

Requirements

The required packages have been added via two files:

  • environment.yaml - this file is for the conda environment if you want to add just run this code in your anaconda prompt when you've entered into the root folder conda env create -f environment.yaml
  • requirements.txt - this file is for using just Python instead of Anaconda, to add the environments, run this code in the root folder in your python environment pip3 install -r requirements.txt