This module introduced both the K Nearest Neighbors model as well as a variety of different metrics for classification. It is important to select and understand the appropriate metric for your task. This exercise is meant to get practice considering the difference between these new classification metrics and accompanying evaluation tools. Specifically, explore datasets related to business from the UCI Machine Learning Repository here.
Select a dataset of interest and clearly state the classification task. Specifically, describe a business problem that could be solved using the dataset and a KNN classification model. Further, identify what you believe to be the appropriate metric and justify your choice. Build a basic model with the KNearestNeighbor and grid search to optimize towards your chosen metric. Share your results with your peers.
After reviewing all the datasets, I decided to analyze the credit approval dataset.
Per UCI Repository
This file concerns credit card applications. All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data.
This dataset is interesting because there is a good mix of attributes -- continuous, nominal with small numbers of values, and nominal with larger numbers of values. There are also a few missing values.
Here's the link to the dataset: https://archive.ics.uci.edu/dataset/27/credit+approval
The dataset has the following information:
name role type demographic description units missing_values
0 A16 Target Categorical None None None no
1 A15 Feature Continuous None None None no
2 A14 Feature Continuous None None None yes
3 A13 Feature Categorical None None None no
4 A12 Feature Categorical None None None no
5 A11 Feature Continuous None None None no
6 A10 Feature Categorical None None None no
7 A9 Feature Categorical None None None no
8 A8 Feature Continuous None None None no
9 A7 Feature Categorical None None None yes
10 A6 Feature Categorical None None None yes
11 A5 Feature Categorical None None None yes
12 A4 Feature Categorical None None None yes
13 A3 Feature Continuous None None None no
14 A2 Feature Continuous None None None yes
15 A1 Feature Categorical None None None yes
Sample data from the dataframe:
There are a total of 690 records. Some of the data is missing. After fixing all the missing data, the final dataset for processing was 653 rows.
Plotting the numerical data, we find that the data is mostly skewed right.
The target variable is spread at 45% positive credit approval and 55% negative credit approval.
Pearson's Correlation on the numerical dataset gives:
The exercise is to use kNNClassifier to perform analysis. Created a pipeline and ran the model against a few hyper parameters.
The Pipleine is as shown below:
Ran the kNNClassifier with n_neighbors = 5 (default value), 12 , 20, and 27.
The Model accuracy for k=5 is 80.15% and for k=12 is 87.02%. This is very clearly evident from the Confusion Matrix
The Precision and Recall curve and the ROC curve for n_neigbhors = 12 are shown below.
The summary view of all the ROC curves is as shown below.
The kNNClassifier model with 12 nearest neighbors produces the highest model accuracy. However, with GridSearchCV, the results were for 27 nearest neighbors. The model is able to reduce the false positives and false negatives as the nearest neighbors increase peaking at 12 and 27.
Learning more about the data would benefit the analyst / data scientist to expand the model and work with missing data. In this dataset, there isn't enough documentation about the attributes. This limits the analyst / data scientist to perform any further analysis on the missing data.