This repository contains an analysis of the Vertebral Column Data Set, a biomedical dataset that classifies patients into Normal (NO) and Abnormal (AB) categories. The binary classification task aims to distinguish between NO (0) and AB (1) using the K-Nearest Neighbors (KNN) algorithm.
The Vertebral Column Data Set contains six biomechanical attributes derived from the pelvis and lumbar spine. You can download the dataset from the following link: Vertebral Column Data Set
Visualize the data by creating scatterplots of the independent variables, using color to distinguish between classes 0 and 1.
Generate boxplots for each independent variable, using color to differentiate between classes 0 and 1.
Split the data into a training set and a test set. Select the first 70 rows of Class 0 and the first 140 rows of Class 1 as the training set, with the remaining data as the test set.
Implement KNN using the Euclidean metric for distance calculation.
-
Test all data in the test database with various values of
k
. Take decisions by majority polling. Plot train and test errors for differentk
values in reverse order, e.g., from 208 to 1, with smaller increments if needed. -
Determine the most suitable
k
(denoted ask*
) based on train and test errors. Calculate the confusion matrix, true positive rate, true negative rate, precision, and F1-score whenk = k*
. -
Optimize for the best test error rate by using a subset of the training set. Plot the best test error rate against the size of the training set (
N
), withN
ranging from 10 to 210. For eachN
, select the optimalk
from a set starting fromk = 1
and increasing by 5. -
Create a "Learning Curve" to visualize the relationship between training set size and the best test error rate.
- You can explore different variants of the KNN algorithm to enhance classification performance.