This code implements a K-Nearest Neighbors (KNN) classifier to predict the onset of diabetes based on several medical predictors. The dataset used for this analysis is sourced from the Pima Indians Diabetes Database. This project includes data preprocessing, the implementation of a custom KNN classifier, model evaluation, and visualization of results.
- numpy - Numerical computing library.
- pandas - Data manipulation and analysis library.
- scikit-learn - Machine learning library for Python.
- matplotlib - Plotting library.
- seaborn - Data visualization library.
- Handling Zero Values: Columns Glucose, BloodPressure, SkinThickness, BMI, and Insulin are not allowed to have zero values. Zeros are replaced with the mean value of the column.
- Train-Test Split: The data is split into training and testing sets using an 80-20 split.
- Euclidean Distance Calculation: A custom function calculates the Euclidean distance between instances.
- Prediction: For each test instance, the K nearest neighbors are found, and the majority class among these neighbors is used as the predicted label.
-
Scatter Plot: Visualization of Glucose vs. BloodPressure for the test set, with different colors indicating the class.
The dataset is the Pima Indians Diabetes Database. It contains medical records for women of Pima Indian heritage, aged 21 and older. The data includes 8 features and a target variable indicating diabetes.
- Pregnancies: Number of times pregnant
- Glucose: Plasma glucose concentration (2 hours in an oral glucose tolerance test)
- BloodPressure: Diastolic blood pressure (mm Hg)
- SkinThickness: Triceps skinfold thickness (mm)
- Insulin: 2-Hour serum insulin (mu U/ml)
- BMI: Body mass index (weight in kg/(height in m)^2)
- DiabetesPedigreeFunction: Diabetes pedigree function
- Age: Age (years)
- Outcome: Class variable (0 or 1), 268 of 768 are 1, the others are 0