/Credit_Approval_kNNClassifer_Model

UCI Data - Credit Approval - kNN Classifier Model

Primary LanguageJupyter Notebook

Problem Statement

This module introduced both the K Nearest Neighbors model as well as a variety of different metrics for classification. It is important to select and understand the appropriate metric for your task. This exercise is meant to get practice considering the difference between these new classification metrics and accompanying evaluation tools. Specifically, explore datasets related to business from the UCI Machine Learning Repository here.

Select a dataset of interest and clearly state the classification task. Specifically, describe a business problem that could be solved using the dataset and a KNN classification model. Further, identify what you believe to be the appropriate metric and justify your choice. Build a basic model with the KNearestNeighbor and grid search to optimize towards your chosen metric. Share your results with your peers.

Credit Approval Dataset

After reviewing all the datasets, I decided to analyze the credit approval dataset.

Per UCI Repository

This file concerns credit card applications. All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data.

This dataset is interesting because there is a good mix of attributes -- continuous, nominal with small numbers of values, and nominal with larger numbers of values. There are also a few missing values.

Here's the link to the dataset: https://archive.ics.uci.edu/dataset/27/credit+approval

Screenshot 2024-07-07 at 5 24 24 PM Screenshot 2024-07-07 at 5 24 50 PM Screenshot 2024-07-07 at 5 25 35 PM

The dataset has the following information:

   name     role         type demographic description units missing_values
0   A16   Target  Categorical        None        None  None             no
1   A15  Feature   Continuous        None        None  None             no
2   A14  Feature   Continuous        None        None  None            yes
3   A13  Feature  Categorical        None        None  None             no
4   A12  Feature  Categorical        None        None  None             no
5   A11  Feature   Continuous        None        None  None             no
6   A10  Feature  Categorical        None        None  None             no
7    A9  Feature  Categorical        None        None  None             no
8    A8  Feature   Continuous        None        None  None             no
9    A7  Feature  Categorical        None        None  None            yes
10   A6  Feature  Categorical        None        None  None            yes
11   A5  Feature  Categorical        None        None  None            yes
12   A4  Feature  Categorical        None        None  None            yes
13   A3  Feature   Continuous        None        None  None             no
14   A2  Feature   Continuous        None        None  None            yes
15   A1  Feature  Categorical        None        None  None            yes

Sample data from the dataframe:

Screenshot 2024-07-07 at 5 35 26 PM

Exploratory Data Analysis (EDA)

There are a total of 690 records. Some of the data is missing. After fixing all the missing data, the final dataset for processing was 653 rows.

Plotting the numerical data, we find that the data is mostly skewed right. histogram_numerical_cols

The target variable is spread at 45% positive credit approval and 55% negative credit approval. Screenshot 2024-07-07 at 5 42 02 PM

Pearson's Correlation on the numerical dataset gives: Pearsons Correlation - Numerical Data

Model Selection and Validation

The exercise is to use kNNClassifier to perform analysis. Created a pipeline and ran the model against a few hyper parameters.

The Pipleine is as shown below:

Screenshot 2024-07-07 at 5 38 36 PM

Ran the kNNClassifier with n_neighbors = 5 (default value), 12 , 20, and 27.

The Model accuracy for k=5 is 80.15% and for k=12 is 87.02%. This is very clearly evident from the Confusion Matrix

Screenshot 2024-07-07 at 6 11 39 PM Screenshot 2024-07-07 at 6 25 58 PM

The Precision and Recall curve and the ROC curve for n_neigbhors = 12 are shown below.

Precision_vs_Recall_kNN-12

ROC_Curve_kNN-12

The summary view of all the ROC curves is as shown below.

roc_All_Inclusive

Conclusion:

The kNNClassifier model with 12 nearest neighbors produces the highest model accuracy. However, with GridSearchCV, the results were for 27 nearest neighbors. The model is able to reduce the false positives and false negatives as the nearest neighbors increase peaking at 12 and 27.

Opportunities

Learning more about the data would benefit the analyst / data scientist to expand the model and work with missing data. In this dataset, there isn't enough documentation about the attributes. This limits the analyst / data scientist to perform any further analysis on the missing data.