Diabetes is a disease that occurs when your blood glucose, also called blood sugar, is too high. Blood glucose is your main source of energy and comes from the food you eat. Insulin, a hormone made by the pancreas, helps glucose from food get into your cells to be used for energy. The disease is depended on other health factors like glucose level, blood pressure etc. The aim of this project is to predict the possibility of having diabetes (presently or in near future) by analysing the statistics of the other health factors.
I have used a Machine Learning Model called 'KNN' (k-Nearest Neighbours) for predicting if a person has diabetes or not. The steps involved in reaching the final results are:
- Reading the dataset
- Extracting the useful information
- Cleaning the dataset
- Understanding the interfence of each factor
- Dividing the dataset into train and test sets
- Creating the algorithm for prediction
- Making test predictions
- Calculate accuracy of our Model
I have also made prediction using the model provided by sklearn, to compare the end results of both the models.
Let's see what is KNN Algorithm
KNN is a supervised machine learning algorithm, which relies on labeled input data to learn a function that produces an appropriate output when given new unlabeled data.
The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other.
So if we have a dataset of cells which have categories as: Plant Cell and Animal Cell and we have a new unlabeled cell. Our task is to find out that our 'new cell' belongs to which category.
Then decide upon the value of 'K' for now lets take it to be 5, so we will calculate the distance of the 5 most nearest cells (the most common method is the Euclidean Distance). And simply pick the category with the most votes. Here the "new cell" will belong to the Animal Cell Category
-
Load the data Initialize K to your chosen number of neighbors
-
For each example in the data
2.1 Calculate the distance between the query example and the current example from the data.
2.2 Add the distance and the index of the example to an ordered collection
-
Sort the ordered collection of distances and indices from smallest to largest (in ascending order) by the distances
-
Pick the first K entries from the sorted collection
-
Get the labels of the selected K entries
-
If regression, return the mean of the K labels
-
If classification, return the mode of the K labels
To select the K that’s right for your data, we run the KNN algorithm several times with different values of K and choose the K that reduces the number of errors we encounter while maintaining the algorithm’s ability to accurately make predictions when it’s given data it hasn’t seen before.
- The algorithm is simple and easy to implement.
- There’s no need to build a model, tune several parameters, or make additional assumptions.
- The algorithm is versatile. It can be used for classification, regression, and search (as we will see in the next section).
- The algorithm gets significantly slower as the number of examples and/or predictors/independent variables increase.
- with Nan values
- without Nan values
Pair Plots are a really simple (one-line-of-code simple!) way to visualize relationships between each variable. It produces a matrix of relationships between each variable in your data for an instant examination of our data. It can also be a great jumping off point for determining types of regression analysis to use.
A heatmap is a graphical representation of data in two-dimension, using colors to demonstrate different factors. Heatmaps are a helpful visual aid for a viewer, enabling the quick dissemination of statistical or data-driven information.
Dateset: diabetes_dataset.csv
Source Code: diabetes_prediction.ipynb
Results: result.csv
Readme File: README.md
Contribution File CONTRIBUTION.md
To test this project on your local computer follow the given steps:
1. fork this repository
2. clone it
3. make sure you have all the Prerequisites mentioned below
4. run the
diabetes_prediction.ipynb file
Make sure you have the latest version of python3, if not you can easily download it from here.
Make sure to update pip
to latest version using 'python -m pip install –upgrade pip
.
The project uses a few python libraries, so make sure you have them too:
numpy
: download it using this documentation.
pandas
: download it using this documentation.
matplotlib
: download it using this documentation.
scikit-learn
: download it using this documentation.
seaborn
: download it using this documentation.
The KNN algorithm which we used had an accuracy of 73.37% The KNN algoritm by sklearn had an accuracy of 75.32%
For making the KNN algorithm more accurate we can play-around with the value of 'K'.
If you do not wish to use kNN we can always go for more accurate Machine Learning Models such as Vector Quantization, Naive Bayes, Support Vactor Machines, etc. I will surely try to solve this problem using different algorithms to show the difference.
If you are curious about kNN algorithms, you can learn more from StatQuest
I would love to recieve your contributions towards this project. Refer to CONTRIBUTION.md for more details.