Prediction of Diabeties using Machine Learning Algorithms

Diabetes scientifically termed Diabetes Mellitus is the chronic condition of high glucose in the bloodstream. The carbohydrate-rich food is broken down into glucose which is transferred into the blood. The pancreas produces a hormone known as insulin that brings the sugar from the blood to all the cells, which generates energy. However, when insulin does not play the role, glucose stays in the blood for a longer time, which can cause serious health problems. Undiagnosed diabetes can lead to serious and untreatable diseases ahead in life. The disease can affect the person irrespective of their age group and require periodic assessments for the same. There are broadly 3 types of diabetes types: Type 1, Type 2, and gestational diabetes. The detection at a budding stage is imminent because that can save millions of people from dying or at least extending their lifetime. In the world of technology and automation, using Artificial Intelligence for better predictive analysis can help identify it from the initial screening, and being an algorithmic process, the probability of facing an error is negligible. People of all gender and status can then afford and avoid the complications. The proposed method can predict budding diabetes with its approach successfully.

Overview

Installment
Dataset
Methodology
Resuls
Conclusion

Installation

Download Diabetes_Prediciton_using_ML.ipynb
Download the dataset in your working directory.
Run this notebook in colab

Dataset

The dataset used for the implementation is the PIMA dataset that is publicly available. The dataset can be downloaded from Kaggle: https://www.kaggle.com/uciml/pima-indians-diabetes-database. The dataset contains 768 instances and has 8 attributes that can help detect the presence of diabetes in a person. The selected dataset is a subset of a larger dataset held by the National Institutes of Diabetics and Digestive and Kidney Diseases, which targets the age from 21 years to 81 years. Diabetic patients account for 34.9% of the whole sample, whereas non-diabetic patients account for 65.1%. The dataset includes attributes like Pregnancies, Age, BMI, Insulin, Blood Pressure, Skin Thickness, and Diabetes Pedigree Function. The results are marked in the outcome column where '0' represents a healthy person while diabetic persons are marked with '1'.

Methodology

The proposed approach first prepares the dataset to feed the module. Preprocessing helps cleaning the dataset that would further increase the efficiency of training the model. To process the module, seven different algorithms are compared in order to understand which works better and more efficiently according to the approach.

Preprocessing

Model Section

Logistic Regression
K Nearest Neighbors
Support Vector Machine
Naive Bayes
Decision Tree
Random Forest
XGBoost

Results and Evaluation

The confusion matrix evaluates the performance of the models by dividing the samples into 4 classes that represent True Positives, True Negatives, False Positives, and False Negatives. The total of all the sections is represented by ‘n’ which is the sample size of the test dataset. Each algorithm is evaluated based on the evaluation matrix that included: precision, recall, f1-score, and support. The accuracies achieved are displayed in the table below:

Models	Accuracy
Logistic Regression	79.87%
K Nearest Neighbors	81.81%
Support Vector Machine	80.51%
Naive Bayes	80.51%
Decision Tree	69.48%
Random Forest	77.92%
XGBoost	78.57%

Conclusion

The reduced features and substitution of zero values with the mean of each attribute assisted in giving better training to the model which helped in achieving better outcomes. Based on the top four features, the models were developed that could easily reach up to 80%. K Nearest Neighbor achieved the highest accuracy amongst all with 81.81%. Although there is quite a room to improve the accuracy and precision of the model that would benefit people in knowing if they are susceptible to diabetes early in life. Diabetes must be detected in its early stages if it is to be treated properly and to curtail its side effects. This paper proposed a machine learning method for predicting diabetes levels. The method may also assist researchers in developing an accurate and useful tool that will reach physicians' tables to assist them in making better decisions about their diabetes state.