/Diabetes-Prediction

Predicting diabetes using various machine learning techniques that would help doctors detect the presence of diabetes using certain diabetes-test related values (attributes) easily and accurately.

Primary LanguageJupyter Notebook

Prediction of Diabeties using Machine Learning Algorithms

Diabetes scientifically termed Diabetes Mellitus is the chronic condition of high glucose in the bloodstream. The carbohydrate-rich food is broken down into glucose which is transferred into the blood. The pancreas produces a hormone known as insulin that brings the sugar from the blood to all the cells, which generates energy. However, when insulin does not play the role, glucose stays in the blood for a longer time, which can cause serious health problems. Undiagnosed diabetes can lead to serious and untreatable diseases ahead in life. The disease can affect the person irrespective of their age group and require periodic assessments for the same. There are broadly 3 types of diabetes types: Type 1, Type 2, and gestational diabetes. The detection at a budding stage is imminent because that can save millions of people from dying or at least extending their lifetime. In the world of technology and automation, using Artificial Intelligence for better predictive analysis can help identify it from the initial screening, and being an algorithmic process, the probability of facing an error is negligible. People of all gender and status can then afford and avoid the complications. The proposed method can predict budding diabetes with its approach successfully.

Overview

  1. Installment
  2. Dataset
  3. Methodology
  4. Resuls
  5. Conclusion

Installation

  • Download Diabetes_Prediciton_using_ML.ipynb
  • Download the dataset in your working directory.
  • Run this notebook in colab

Dataset

The dataset used for the implementation is the PIMA dataset that is publicly available. The dataset can be downloaded from Kaggle: https://www.kaggle.com/uciml/pima-indians-diabetes-database. The dataset contains 768 instances and has 8 attributes that can help detect the presence of diabetes in a person. The selected dataset is a subset of a larger dataset held by the National Institutes of Diabetics and Digestive and Kidney Diseases, which targets the age from 21 years to 81 years. Diabetic patients account for 34.9% of the whole sample, whereas non-diabetic patients account for 65.1%. The dataset includes attributes like Pregnancies, Age, BMI, Insulin, Blood Pressure, Skin Thickness, and Diabetes Pedigree Function. The results are marked in the outcome column where '0' represents a healthy person while diabetic persons are marked with '1'.

Methodology

The proposed approach first prepares the dataset to feed the module. Preprocessing helps cleaning the dataset that would further increase the efficiency of training the model. To process the module, seven different algorithms are compared in order to understand which works better and more efficiently according to the approach.

  1. Preprocessing
  2. To segregate important and dependent features. a correlation matrix is plotted. The correlation matrix defines the relationship between the attributes which further aids to determine important features from the rest. The values range from -1 to +1 where a positive relation is represented by the positive value while a negative value depicts a negative relationship. Top features are separated from the rest namely Glucose, BMI, Insulin, and Age and with those attributes, the dataset is fed to the model for prediction. The zero values from the dataset are filled with the mean of the individual attribute in order to let the dataset record count constant. The Min-Max Scalar is used to normalize the dataset and then finally the dataset is split into 8: 2 ratio i.e 80% for the training dataset and 20% for testing and validation of the models.

    Correlation Matrix
  3. Model Section
  4. The approach is implemented with seven machine learning algorithms individually to compare which algorithms works better than the others and can be further developed to upsurge the accuracy. The algorithms experimented are:
    • Logistic Regression
    • K Nearest Neighbors
    • Support Vector Machine
    • Naive Bayes
    • Decision Tree
    • Random Forest
    • XGBoost

Results and Evaluation

The confusion matrix evaluates the performance of the models by dividing the samples into 4 classes that represent True Positives, True Negatives, False Positives, and False Negatives. The total of all the sections is represented by ‘n’ which is the sample size of the test dataset. Each algorithm is evaluated based on the evaluation matrix that included: precision, recall, f1-score, and support. The accuracies achieved are displayed in the table below:

Models Accuracy
Logistic Regression 79.87%
K Nearest Neighbors 81.81%
Support Vector Machine 80.51%
Naive Bayes 80.51%
Decision Tree 69.48%
Random Forest 77.92%
XGBoost 78.57%
Results

Conclusion

The reduced features and substitution of zero values with the mean of each attribute assisted in giving better training to the model which helped in achieving better outcomes. Based on the top four features, the models were developed that could easily reach up to 80%. K Nearest Neighbor achieved the highest accuracy amongst all with 81.81%. Although there is quite a room to improve the accuracy and precision of the model that would benefit people in knowing if they are susceptible to diabetes early in life. Diabetes must be detected in its early stages if it is to be treated properly and to curtail its side effects. This paper proposed a machine learning method for predicting diabetes levels. The method may also assist researchers in developing an accurate and useful tool that will reach physicians' tables to assist them in making better decisions about their diabetes state.