/KNN-Implementation-Numpy

KNN / weighted-KNN implementation with numpy

Primary LanguageJupyter Notebook

BBM409 Introduction to Machine Learning Lab. Fall 2022.

Assignment 1: PART 1 : Personality Classification

Contributors:

Ali Argun Sayilgan : 21827775

Mehmet Giray Nacakci : 21989009

Please run this report with "RUN ALL" command

Dataset

The dataset consists of 10k answers for 60 questions from the 16 Personality Test and their ground truth labels(Personality Types).
Answers to the questions are stored in following manner:

Fully Agree: 3
Partially Agree: 2
Slightly Agree: 1
Neutral: 0
Slightly disagree: -1
Partially disagree: -2
Fully disagree: -3

import pandas as pd
pd.set_option('display.precision', 6)
import numpy as np

df = pd.read_csv("subset_16P.csv", encoding='cp1252')
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Response Id You regularly make new friends. You spend a lot of your free time exploring various random topics that pique your interest Seeing other people cry can easily make you feel like you want to cry too You often make a backup plan for a backup plan. You usually stay calm, even under a lot of pressure At social events, you rarely try to introduce yourself to new people and mostly talk to the ones you already know You prefer to completely finish one project before starting another. You are very sentimental. You like to use organizing tools like schedules and lists. ... You believe that pondering abstract philosophical questions is a waste of time. You feel more drawn to places with busy, bustling atmospheres than quiet, intimate places. You know at first glance how someone is feeling. You often feel overwhelmed. You complete things methodically without skipping over any steps. You are very intrigued by things labeled as controversial. You would pass along a good opportunity if you thought someone else needed it more. You struggle with deadlines. You feel confident that things will work out for you. Personality
0 35874 -1 0 -1 1 -1 -2 -2 0 -1 ... 0 3 0 0 0 0 1 -1 0 ENTP
1 42624 0 0 1 0 0 0 -1 0 0 ... 0 2 0 0 0 0 -1 -3 2 INTP
2 55199 0 0 -2 -1 2 -2 0 0 -1 ... 0 0 0 1 0 0 3 0 0 ESTP
3 52983 0 0 0 1 -2 -1 0 0 1 ... 1 1 0 -1 0 -1 2 -2 0 ENTP
4 22864 0 0 2 1 0 -2 -1 0 1 ... 1 -2 0 1 0 0 0 -2 2 ENFJ

5 rows × 62 columns

Preprocessing the dataset

In order to use string type ground truth labels in our ML algorithms more effectively, we preferred to convert them to integers ranged from 0 to 15.

numberOfClasses = len(df["Personality"].unique())
personality_types =[ "ESTJ", "ENTJ", "ESFJ", "ENFJ", "ISTJ", "ISFJ",
"INTJ", "INFJ", "ESTP", "ESFP", "ENTP", "ENFP",
"ISTP", "ISFP", "INTP", "INFP" ]
df.Personality = df.Personality.astype("category", personality_types).cat.codes
df.Personality.describe()
count    10000.000000
mean         7.500200
std          4.621811
min          0.000000
25%          3.000000
50%          8.000000
75%         12.000000
max         15.000000
Name: Personality, dtype: float64
df.Personality.value_counts()
9     662
3     641
15    641
7     640
0     632
1     629
14    625
8     625
11    624
12    624
2     622
4     620
13    616
5     615
10    596
6     588
Name: Personality, dtype: int64

Number of occurences of class types

image.png

As can seen, there isn't much of an inbalance for classes at the dataset.

Later, we splitted the dataset to X(question answers) and y(ground truth Personality labels).

X = df.drop(['Response Id','Personality'], axis=1)
y = df.Personality

X= X.to_numpy()
Y= y.to_numpy()

Feature Normalization

Some machine learning algorithms are sensitive to feature scaling while others are virtually invariant to it. Since we are using KNN classifier in this project, if we don't use any feature normalization methods, some features of the dataset can influence the prediction more than other features and this generally isn't a thing we want.

df.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Response Id You regularly make new friends. You spend a lot of your free time exploring various random topics that pique your interest Seeing other people cry can easily make you feel like you want to cry too You often make a backup plan for a backup plan. You usually stay calm, even under a lot of pressure At social events, you rarely try to introduce yourself to new people and mostly talk to the ones you already know You prefer to completely finish one project before starting another. You are very sentimental. You like to use organizing tools like schedules and lists. ... You believe that pondering abstract philosophical questions is a waste of time. You feel more drawn to places with busy, bustling atmospheres than quiet, intimate places. You know at first glance how someone is feeling. You often feel overwhelmed. You complete things methodically without skipping over any steps. You are very intrigued by things labeled as controversial. You would pass along a good opportunity if you thought someone else needed it more. You struggle with deadlines. You feel confident that things will work out for you. Personality
count 10000.000000 10000.00000 10000.000000 10000.00000 10000.000000 10000.00000 10000.000000 10000.00000 10000.000000 10000.000000 ... 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.0000 10000.000000 10000.000000 10000.000000
mean 30033.526600 -0.00420 0.002100 0.01470 -0.211000 -0.14970 0.012500 -0.45950 0.002400 0.130300 ... 0.000700 0.123400 -0.002900 0.258500 -0.004600 -0.002400 0.1192 -0.027200 0.100300 7.500200
std 17310.103985 0.37013 0.370013 1.53796 1.523388 1.49416 1.514983 1.45278 0.362777 1.535629 ... 0.364572 1.528073 0.371087 1.495494 0.363857 0.368792 1.5250 1.531305 1.561885 4.621811
min 0.000000 -1.00000 -1.000000 -3.00000 -3.000000 -3.00000 -3.000000 -3.00000 -1.000000 -3.000000 ... -1.000000 -3.000000 -1.000000 -3.000000 -1.000000 -1.000000 -3.0000 -3.000000 -3.000000 0.000000
25% 15058.750000 0.00000 0.000000 -1.00000 -1.000000 -1.00000 -1.000000 -2.00000 0.000000 -1.000000 ... 0.000000 -1.000000 0.000000 -1.000000 0.000000 0.000000 -1.0000 -1.000000 -1.000000 3.000000
50% 29961.500000 0.00000 0.000000 0.00000 0.000000 0.00000 0.000000 -1.00000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0000 0.000000 0.000000 8.000000
75% 45206.750000 0.00000 0.000000 1.00000 1.000000 1.00000 1.000000 0.00000 0.000000 1.000000 ... 0.000000 1.000000 0.000000 1.000000 0.000000 0.000000 1.0000 1.000000 1.000000 12.000000
max 59997.000000 1.00000 1.000000 3.00000 3.000000 3.00000 3.000000 3.00000 1.000000 3.000000 ... 1.000000 3.000000 1.000000 3.000000 1.000000 1.000000 3.0000 3.000000 3.000000 15.000000

8 rows × 62 columns

As can seen from the table, the distribution of answers for each column is different. Even though possible answer scores range from -3 to 3, answers for some questions are ranging between different numbers. Thus, when we don't use feature normalization, some columns are going to influence outcome more.
In this project, we will be using MinMaxScaler for our feature normalization algorithm.

MinMaxScaler

MinMaxScaler shrinks the data within the given range, usually 0 to 1. In this project we will shrink each column to 0 to 1 range with the formula given below.

image.png


Another important point to mention is that when scaling your train and test datasets, you need to avoid information leakage onto the test dataset. So if you scale your test dataset with the min max values from test dataset itself, you leak information of min max values of the whole test dataset, through your model and it's a bad practice. Thus you must use the min max values from the training dataset while scaling.
class MinMaxScaler():
    def __init__(self):
        self.mins = []
        self.maxes = []
        
    def fit_transform(self, X):
        self.mins = X.min(axis=0)
        self.maxes = X.max(axis=0)
        maxMinusMin = self.maxes - self.mins
        return (X - self.mins) / maxMinusMin
    
    
    def transform(self, X):
        maxMinusMin = self.maxes - self.mins
        return (X - self.mins) / maxMinusMin

KFold

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.

The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.

Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data. That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.

It is a popular method because it is simple to understand and because it generally results in a less biased or less optimistic estimate of the model skill than other methods, such as a simple train/test split.

The general procedure is as follows:

  1. Shuffle the dataset randomly.
  2. Split the dataset into k groups
  3. For each unique group:
  4. Take the group as a hold out or test data set
  5. Take the remaining groups as a training data set
  6. Fit a model on the training set and evaluate it on the test set
  7. Retain the evaluation score and discard the model
  8. Summarize the skill of the model using the sample of model evaluation scores

[source] (https://machinelearningmastery.com/k-fold-cross-validation/)
import random

class KFold():
    def __init__(self,n_splits=5, shuffle=True, random_state=42):
        self.shuffle = shuffle
        self.n_splits=n_splits
        self.random_state= random_state

    # Fisher-Yates Shuffle Algorithm
    def shuffler (self, arr, n):
        random.seed(n)
        rowSize = arr.shape[0]
        for i in range(rowSize-1,0,-1):
            
            # random index from 0 to i
            j = random.randint(0,i+1)
            
            # Swap with random index
            arr[[i, j]] = arr[[j, i]]
        return arr


    def split(self, X, y):
        if(self.shuffle):
            X = self.shuffler(X, self.random_state)
            y = self.shuffler(y, self.random_state)          
        
        rowSize = len(X)
        testSetSize = rowSize // self.n_splits
        for i in range(self.n_splits):
            if(i==0):
                x_train = X[(i+1)*testSetSize :,]
                y_train = Y[(i+1)*testSetSize :,]
            elif(i==self.n_splits-1):
                x_train = X[:i*testSetSize,]
                y_train = Y[:i*testSetSize,]
            else:
                # [ row1,row2, ..., x_train_rows, rowk, ...]
                # appending rows prior to x_train with rows comes after x_train
                x_train_smaller_indices = X[:i*testSetSize,]
                y_train_smaller_indices = Y[:i*testSetSize,]
                x_train = np.append(
                    x_train_smaller_indices, X[(i+1)*testSetSize :,], axis = 0
                )
                y_train = np.append(
                    y_train_smaller_indices, Y[(i+1)*testSetSize :,], axis = 0
                )
            

            if(i!=self.n_splits-1):
                x_test = X[i*testSetSize : (i+1)*testSetSize,]
                y_test = Y[i*testSetSize : (i+1)*testSetSize,]
            else:
#           because we calculate testSetSize with //, 
#           last split must finish through the end of the whole array
                x_test = X[i*testSetSize :,]
                y_test = Y[i*testSetSize :,]
            yield (x_train, x_test, y_train, y_test)

KNNClassifier

KNN is an instance-based learning method. Instance-based learning (memory-based learning, lazy learning) is a family of learning algorithms that, instead of performing explicit generalization, compare new problem instances with instances seen in training, which have been stored in memory.

There are efficient implementations to store the data using complex data structures like k-d trees to make look-up and matching of new patterns during prediction more efficient. But in this project we will be making use of basic numpy arrays.

Prediction algorithm

Predictions are made for a new instance (x) by searching through the entire training set for the K most similar instances (the neighbors) and summarizing the output variable for those K instances. For regression this might be the mean output variable, in classification this might be the mod (or most common) class value.

To determine which of the K instances in the training dataset are most similar to a new input, a distance measure is used. The most popular distance measures are Euclidean Distance, Manhattan Distance, Minkowski Distance, Hamming Distance. In this project we will be using Euclidean distance.

Euclidean distance formula:
image.png

class KNNClassifier():
    def __init__(self, n_neighbors=5, weights='uniform', n_classes = 16):

        self.X_train = None
        self.y_train = None
        
        self.n_classes = n_classes
        self.n_neighbors = n_neighbors
        self.weights = weights

        
    def fit(self, X_train, y_train):
        self.X_train = X_train
        self.y_train = y_train

    def euclidian_distance(self, a, b):
        distances = np.sqrt(np.sum((a - b)**2, axis=1))
        # prevent division by zero
        distances[np.where(distances < 0.00001)] = 0.00001
        return distances


    def kneighbors(self, X_test, return_distance=False):

        dist = []
        neigh_ind = []

        point_dist = [self.euclidian_distance(x_test, self.X_train) for x_test in X_test]

        for row in point_dist:
            enum_neigh = enumerate(row)
            sorted_neigh = sorted(enum_neigh,
                                  key=lambda x: x[1])[:self.n_neighbors]

            ind_list = [tup[0] for tup in sorted_neigh]
            dist_list = [tup[1] for tup in sorted_neigh]

            dist.append(dist_list)
            neigh_ind.append(ind_list)

        if return_distance:
            return np.array(dist), np.array(neigh_ind)

        return np.array(neigh_ind)


    def predict(self, X_test):

        # non-weighted knn, majority voting of neighbors for classification
        if self.weights == 'uniform':
            neighbors = self.kneighbors(X_test)
            y_pred = np.array([
                np.argmax(np.bincount(self.y_train[neighbor]))
                for neighbor in neighbors
            ])
            return y_pred


        # weighted knn, voting based on weights of neighbors
        elif self.weights == 'distance':

            dist, neigh_ind = self.kneighbors(X_test, return_distance=True)

            inv_dist = 1 / dist

            mean_inv_dist = inv_dist / np.sum(inv_dist, axis=1)[:, np.newaxis]

            proba = []

            for i, row in enumerate(mean_inv_dist):

                row_pred = self.y_train[neigh_ind[i]]

                for k in range(self.n_classes):
                    indices = np.where(row_pred == k)
                    prob_ind = np.sum(row[indices])
                    proba.append(np.array(prob_ind))

            predict_proba = np.array(proba).reshape(X_test.shape[0],
                                                    self.n_classes)

            y_pred = np.array([np.argmax(item) for item in predict_proba])

            return y_pred


        # used for interpretation of misclassified samples, return also nearest neighbors
        elif self.weights == 'uniform_neighbors':
            neighbors = self.kneighbors(X_test)  # nearestNeighborsIndices_of_all_testSamples
            y_pred = np.array([
                np.argmax(np.bincount(self.y_train[neighbor]))
                for neighbor in neighbors
            ])
            return y_pred, neighbors

Pipeline

A pipeline is a linear sequence of data preparation options, modeling operations, and prediction transform operations.

It allows the sequence of steps to be specified, evaluated, and used as an atomic unit. Like:

  1. [Input], [Normalization], [KNN Classifier], [Predictions]
  2. [Input], [Standardization], [RFE], [SVM], [Predictions]
# from scipy import stats

class Pipeline():
    def __init__(self, scaler=None, classifier=None):
        self.scaler = scaler
        self.classifier = classifier

    def execute(self,x_train, x_test, y_train):
        if(self.scaler is not None):
            x_train = self.scaler.fit_transform(x_train)
            x_test = self.scaler.transform(x_test)
        if(self.classifier is not None):
            self.classifier.fit(x_train, y_train)
            return self.classifier.predict(x_test)
            

Classification metrics

A classifier is only as important as the metric used to evaluate it.

If you choose the wrong metric to evaluate your models, you are likely to choose a poor model, or in the worst case, be misled about the expected performance of your model.

And choosing the right classification metric is particularly difficult for imbalanced classification problems. Firstly, because most of the standard metrics that are widely used assume a balanced class distribution, and because typically not all classes, and therefore, not all prediction errors, are equal for imbalanced classification.

In this project we will be using Accuracy, Precision and Recall metrics to evaluate our ML models' predictions.

Accuracy

image.png

def accuracy(pred, actual):
    return sum(pred == actual) / len(pred)

Precision

image.png
Since there are 16 ground truth labels, we take the average precision of all labels.

def precision(pred, actual):
    if(len(pred) == 0 or len(pred) != len(actual)):
        return -1
    labels= []
    truePositivesPerLabel = {}
    falsePositivesPerLabel = {}
    precisionPerLabel = {}

    
    for i in range(len(pred)):
        prediction = pred[i]
        if prediction not in labels:
            labels.append(prediction)
            truePositivesPerLabel[prediction] = 0
            falsePositivesPerLabel[prediction] = 0
        
        if(pred[i] == actual[i]):
            truePositivesPerLabel[prediction] +=1
        else:
            falsePositivesPerLabel[prediction] +=1
    
    # count of the labels that are existed inside the ground truth or prediction
    existedLabelCount = 0
    
    precisionSum = 0
    for label in labels:
        denominator = truePositivesPerLabel[label] + falsePositivesPerLabel[label]
        if(denominator >=0):
            existedLabelCount +=1
            precisionSum += truePositivesPerLabel[label] / denominator
    
    return precisionSum / existedLabelCount
        
        

Recall

image.png
Since there are 16 ground truth labels, we take the average recall precision of all labels.

def recall(pred, actual):
    if(len(pred) == 0 or len(pred) != len(actual)):
        return -1
    labels= []
    truePositivesPerLabel = {}
    falseNegativesPerLabel = {}
    recallPerLabel = {}
    
    for i in range(len(actual)):
        actualClass = actual[i]
        if actualClass not in labels:
            labels.append(actualClass)
            truePositivesPerLabel[actualClass] = 0
            falseNegativesPerLabel[actualClass] = 0
        
        if(pred[i] == actual[i]):
            truePositivesPerLabel[actualClass] +=1
        else:
            falseNegativesPerLabel[actualClass] +=1
    
    # count of the labels that are existed inside the ground truth or prediction
    existedLabelCount = 0
    
    recallSum = 0
    for label in labels:
        denominator = truePositivesPerLabel[label] + falseNegativesPerLabel[label]
        if(denominator >=0):
            existedLabelCount +=1
            recallSum += truePositivesPerLabel[label] / denominator
    
    return recallSum / existedLabelCount

Cross validation scores

When evaluating different settings (“hyperparameters”) for estimators, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance. To solve this problem, yet another part of the dataset can be held out as a so-called “validation set”: training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set.

However, by partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets.

A solution to this problem is a procedure called cross-validation. A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller setss. The following procedure is followed for each of the k “folds”:

  • A model is trained using of the folds as training data;
  • the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).

grid_search_cross_validation.png

def cross_val_score(X, Y, cv, pipeline):

    accuracy_folds = []
    precision_folds = []
    recall_folds = []

    # for each Fold of 5-fold-validation
    for (x_train, x_test, y_train, y_test) in cv.split(X,Y): 
        y_pred = pipeline.execute(x_train, x_test, y_train)

        accuracy_folds.append(accuracy(y_pred, y_test))
        precision_folds.append(recall(y_pred, y_test))
        recall_folds.append(precision(y_pred, y_test))

    # averages of folds
    accuracy_folds.append(sum(accuracy_folds)/5)
    precision_folds.append(sum(precision_folds)/5)
    recall_folds.append(sum(recall_folds)/5)

    return accuracy_folds, precision_folds, recall_folds

Run non-weighted and weighted KNN models (20 variations)

Now we will be comparing our models' performance with/without feature normalization and with different k_neighbors values as KNNClassifier parameter.

cv = KFold(5, shuffle=True, random_state=24)
scaler = MinMaxScaler()
neighborVariations = [1,3,5,7,9]

accuracy_table_columns = []
precision_table_columns = []
recall_table_columns = []

import time

def run_all_models():
    print(" \nResults of 20 KNN model variations will be ready after ABOUT  25  MINUTES  of execution. Please wait... \n")
    progress = 1
    start = time.time()

    """  ***   NON-WEIGHTED KNN   *** """
    for k in neighborVariations:   # THIS LOOP TAKES ABOUT 10 MINUTES TO COMPLETE

        knnUniform = KNNClassifier(n_neighbors=k, weights='uniform', n_classes=numberOfClasses)

        # with feature normalization
        print("  KNN model variation no:   " + str(progress) + "  is started being processed..." )
        pipeline = Pipeline(scaler=scaler, classifier=knnUniform)
        accuracies, precisions, recalls = cross_val_score(X, Y, cv, pipeline)
        accuracy_table_columns.append(accuracies)
        precision_table_columns.append(precisions)
        recall_table_columns.append(recalls)
        print("   KNN model variation no:  " + str(progress) + "  processing is finished.\n" )
        progress += 1

        # without feature normalization
        print("  KNN model variation no:   " + str(progress) + "  is started being processed..." )
        pipeline = Pipeline(classifier=knnUniform)
        accuracies, precisions, recalls = cross_val_score(X, Y, cv, pipeline)
        accuracy_table_columns.append(accuracies)
        precision_table_columns.append(precisions)
        recall_table_columns.append(recalls)
        print("   KNN model variation no:  " + str(progress) + "  processing is finished.\n" )
        progress += 1


    """  ***   WEIGHTED KNN   *** """
    for k in neighborVariations:   # THIS LOOP TAKES ABOUT 15 MINUTES TO COMPLETE

        knnDistance = KNNClassifier(n_neighbors=5, weights='distance', n_classes=numberOfClasses)

        # with feature normalization
        print("  KNN model variation no:   " + str(progress) + "  is started being processed..." )
        pipeline = Pipeline(scaler=scaler, classifier=knnDistance)
        accuracies, precisions, recalls = cross_val_score(X, Y, cv, pipeline)
        accuracy_table_columns.append(accuracies)
        precision_table_columns.append(precisions)
        recall_table_columns.append(recalls)
        print("   KNN model variation no:  " + str(progress) + "  processing is finished.\n" )
        progress += 1

        # without feature normalization
        print("  KNN model variation no:   " + str(progress) + "  is started being processed..." )
        pipeline = Pipeline(classifier=knnDistance)
        accuracies, precisions, recalls = cross_val_score(X, Y, cv, pipeline)
        accuracy_table_columns.append(accuracies)
        precision_table_columns.append(precisions)
        recall_table_columns.append(recalls)
        print("   KNN model variation no:   " + str(progress) + "  processing is finished.\n" )
        progress += 1


    # model calculations are finished.
    finish = time.time()
    seconds = finish-start
    minutes = seconds//60
    seconds -= 60*minutes
    print("Results of 20 KNN model variations are ready in the sections below. Thank you for your patience.")
    print('Elapsed time is:   %d:%d   minutes:seconds' %(minutes,seconds))


run_all_models()
Results of 20 KNN model variations will be ready after ABOUT  25  MINUTES  of execution. Please wait... 

  KNN model variation no:   1  is started being processed...
   KNN model variation no:  1  processing is finished.

  KNN model variation no:   2  is started being processed...
   KNN model variation no:  2  processing is finished.

  KNN model variation no:   3  is started being processed...
   KNN model variation no:  3  processing is finished.

  KNN model variation no:   4  is started being processed...
   KNN model variation no:  4  processing is finished.

  KNN model variation no:   5  is started being processed...
   KNN model variation no:  5  processing is finished.

  KNN model variation no:   6  is started being processed...
   KNN model variation no:  6  processing is finished.

  KNN model variation no:   7  is started being processed...
   KNN model variation no:  7  processing is finished.

  KNN model variation no:   8  is started being processed...
   KNN model variation no:  8  processing is finished.

  KNN model variation no:   9  is started being processed...
   KNN model variation no:  9  processing is finished.

  KNN model variation no:   10  is started being processed...
   KNN model variation no:  10  processing is finished.

  KNN model variation no:   11  is started being processed...
   KNN model variation no:  11  processing is finished.

  KNN model variation no:   12  is started being processed...
   KNN model variation no:   12  processing is finished.

  KNN model variation no:   13  is started being processed...
   KNN model variation no:  13  processing is finished.

  KNN model variation no:   14  is started being processed...
   KNN model variation no:   14  processing is finished.

  KNN model variation no:   15  is started being processed...
   KNN model variation no:  15  processing is finished.

  KNN model variation no:   16  is started being processed...
   KNN model variation no:   16  processing is finished.

  KNN model variation no:   17  is started being processed...
   KNN model variation no:  17  processing is finished.

  KNN model variation no:   18  is started being processed...
   KNN model variation no:   18  processing is finished.

  KNN model variation no:   19  is started being processed...
   KNN model variation no:  19  processing is finished.

  KNN model variation no:   20  is started being processed...
   KNN model variation no:   20  processing is finished.

Results of 20 KNN model variations are ready in the sections below. Thank you for your patience.
Elapsed time is:   30:54   minutes:seconds

Cross Validation Scores (Tables will be ready after about 25 minutes of execution)

Accuracy Results Table

def draw_accuracy_table():
    print("------ Accuracy - for 20 model variations ------")
    accuracy_rows = np.transpose(np.array(accuracy_table_columns))
    accuracy_table = pd.DataFrame(accuracy_rows, columns = ['1: k=1 w- n+','2: k=1 w- n-','3: k=3 w- n+','4: k=3 w- n-','5: k=5 w- n+','6: k=5 w- n-','7: k=7 w- n+','8: k=7 w- n-','9: k=9 w- n+','10: k=9 w- n-','11: k=1 w+ n+','12: k=1 w+ n-','13: k=3 w+ n+','14: k=3 w+ n-','15: k=5 w+ n+','16: k=5 w+ n-','17: k=7 w+ n+','18: k=7 w+ n-','19: k=9 w+ n+','20: k=9 w+ n-'])
    accuracy_table.index = ['Fold 1', 'Fold 2', 'Fold 3', 'Fold 4', 'Fold 5', 'Average of Folds']

    display(accuracy_table.iloc[:, :10].head(6))
    display(accuracy_table.iloc[:, 10:].head(6))

    print("model variations encoding: \n k=  : k parameter of KNN \n w+  :       weighted KNN \n w-  :   non-weighted KNN \n n+  :    with feature normalization \n n-  : without feature normalization \n")

draw_accuracy_table()
------ Accuracy - for 20 model variations ------
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
1: k=1 w- n+ 2: k=1 w- n- 3: k=3 w- n+ 4: k=3 w- n- 5: k=5 w- n+ 6: k=5 w- n- 7: k=7 w- n+ 8: k=7 w- n- 9: k=9 w- n+ 10: k=9 w- n-
Fold 1 0.9645 0.9750 0.9790 0.9835 0.9820 0.9870 0.9850 0.9910 0.9850 0.9900
Fold 2 0.9635 0.9740 0.9750 0.9875 0.9865 0.9900 0.9850 0.9840 0.9845 0.9895
Fold 3 0.9565 0.9760 0.9760 0.9890 0.9800 0.9900 0.9845 0.9915 0.9845 0.9840
Fold 4 0.9640 0.9725 0.9790 0.9865 0.9840 0.9890 0.9860 0.9900 0.9860 0.9895
Fold 5 0.9650 0.9770 0.9830 0.9875 0.9765 0.9835 0.9845 0.9850 0.9880 0.9865
Average of Folds 0.9627 0.9749 0.9784 0.9868 0.9818 0.9879 0.9850 0.9883 0.9856 0.9879
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
11: k=1 w+ n+ 12: k=1 w+ n- 13: k=3 w+ n+ 14: k=3 w+ n- 15: k=5 w+ n+ 16: k=5 w+ n- 17: k=7 w+ n+ 18: k=7 w+ n- 19: k=9 w+ n+ 20: k=9 w+ n-
Fold 1 0.9790 0.9885 0.9845 0.9860 0.9840 0.9905 0.9870 0.9855 0.9875 0.9875
Fold 2 0.9885 0.9890 0.9885 0.9875 0.9845 0.9855 0.9845 0.9860 0.9795 0.9875
Fold 3 0.9860 0.9860 0.9855 0.9895 0.9850 0.9850 0.9835 0.9920 0.9840 0.9855
Fold 4 0.9785 0.9900 0.9815 0.9890 0.9875 0.9900 0.9830 0.9880 0.9865 0.9890
Fold 5 0.9850 0.9875 0.9830 0.9870 0.9820 0.9890 0.9835 0.9860 0.9815 0.9880
Average of Folds 0.9834 0.9882 0.9846 0.9878 0.9846 0.9880 0.9843 0.9875 0.9838 0.9875
model variations encoding: 
 k=  : k parameter of KNN 
 w+  :       weighted KNN 
 w-  :   non-weighted KNN 
 n+  :    with feature normalization 
 n-  : without feature normalization 

Previously calculated:

Precision Results Table

def draw_precision_table():
    print("------ Precision - for 20 model variations ------")
    precision_rows = np.transpose(np.array(precision_table_columns))
    precision_table = pd.DataFrame(precision_rows, columns = ['1: k=1 w- n+','2: k=1 w- n-','3: k=3 w- n+','4: k=3 w- n-','5: k=5 w- n+','6: k=5 w- n-','7: k=7 w- n+','8: k=7 w- n-','9: k=9 w- n+','10: k=9 w- n-','11: k=1 w+ n+','12: k=1 w+ n-','13: k=3 w+ n+','14: k=3 w+ n-','15: k=5 w+ n+','16: k=5 w+ n-','17: k=7 w+ n+','18: k=7 w+ n-','19: k=9 w+ n+','20: k=9 w+ n-'])
    precision_table.index = ['Fold 1', 'Fold 2', 'Fold 3', 'Fold 4', 'Fold 5', 'Average of Folds']

    display(precision_table.iloc[:, :10].head(6))
    display(precision_table.iloc[:, 10:].head(6))

    print("model variations encoding: \n k=  : k parameter of KNN \n w+  :       weighted KNN \n w-  :   non-weighted KNN \n n+  :    with feature normalization \n n-  : without feature normalization \n")

draw_precision_table()
------ Precision - for 20 model variations ------
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
1: k=1 w- n+ 2: k=1 w- n- 3: k=3 w- n+ 4: k=3 w- n- 5: k=5 w- n+ 6: k=5 w- n- 7: k=7 w- n+ 8: k=7 w- n- 9: k=9 w- n+ 10: k=9 w- n-
Fold 1 0.964931 0.975168 0.979575 0.983745 0.982096 0.987054 0.984926 0.991138 0.985189 0.990488
Fold 2 0.963493 0.974289 0.975090 0.987569 0.986388 0.990114 0.985068 0.984485 0.984691 0.989504
Fold 3 0.956300 0.975652 0.975711 0.988692 0.980101 0.990182 0.984816 0.991562 0.984917 0.984084
Fold 4 0.964323 0.972851 0.978742 0.986759 0.983876 0.988699 0.985986 0.990225 0.986165 0.989465
Fold 5 0.965049 0.977203 0.983165 0.988015 0.976901 0.983499 0.984388 0.984856 0.987689 0.986730
Average of Folds 0.962819 0.975033 0.978457 0.986956 0.981872 0.987910 0.985037 0.988453 0.985730 0.988054
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
11: k=1 w+ n+ 12: k=1 w+ n- 13: k=3 w+ n+ 14: k=3 w+ n- 15: k=5 w+ n+ 16: k=5 w+ n- 17: k=7 w+ n+ 18: k=7 w+ n- 19: k=9 w+ n+ 20: k=9 w+ n-
Fold 1 0.979140 0.988587 0.984398 0.985933 0.983995 0.990601 0.987326 0.985271 0.987389 0.988169
Fold 2 0.988541 0.988852 0.988575 0.987664 0.984403 0.985467 0.984774 0.986314 0.979564 0.987426
Fold 3 0.985990 0.986343 0.985762 0.989589 0.984820 0.985038 0.983225 0.992135 0.984215 0.985262
Fold 4 0.978856 0.989977 0.982044 0.989305 0.987463 0.989724 0.982838 0.987977 0.986498 0.988880
Fold 5 0.985090 0.987503 0.983079 0.986972 0.982141 0.989050 0.983022 0.986275 0.981226 0.987906
Average of Folds 0.983523 0.988252 0.984772 0.987893 0.984564 0.987976 0.984237 0.987594 0.983779 0.987529
model variations encoding: 
 k=  : k parameter of KNN 
 w+  :       weighted KNN 
 w-  :   non-weighted KNN 
 n+  :    with feature normalization 
 n-  : without feature normalization 

Previously calculated:

Recall Results Table

def draw_recall_table():
    print("------ Recall - for 20 model variations ------")
    recall_rows = np.transpose(np.array(recall_table_columns))
    recall_table = pd.DataFrame(recall_rows, columns = ['1: k=1 w- n+','2: k=1 w- n-','3: k=3 w- n+','4: k=3 w- n-','5: k=5 w- n+','6: k=5 w- n-','7: k=7 w- n+','8: k=7 w- n-','9: k=9 w- n+','10: k=9 w- n-','11: k=1 w+ n+','12: k=1 w+ n-','13: k=3 w+ n+','14: k=3 w+ n-','15: k=5 w+ n+','16: k=5 w+ n-','17: k=7 w+ n+','18: k=7 w+ n-','19: k=9 w+ n+','20: k=9 w+ n-'])
    recall_table.index = ['Fold 1', 'Fold 2', 'Fold 3', 'Fold 4', 'Fold 5', 'Average of Folds']

    display(recall_table.iloc[:, :10].head(6))
    display(recall_table.iloc[:, 10:].head(6))

    print("model variations encoding: \n k=  : k parameter of KNN \n w+  :       weighted KNN \n w-  :   non-weighted KNN \n n+  :    with feature normalization \n n-  : without feature normalization \n")

draw_recall_table()
------ Recall - for 20 model variations ------
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
1: k=1 w- n+ 2: k=1 w- n- 3: k=3 w- n+ 4: k=3 w- n- 5: k=5 w- n+ 6: k=5 w- n- 7: k=7 w- n+ 8: k=7 w- n- 9: k=9 w- n+ 10: k=9 w- n-
Fold 1 0.964879 0.974979 0.979374 0.983124 0.982098 0.987231 0.984686 0.991006 0.985051 0.990108
Fold 2 0.963946 0.974364 0.975125 0.987568 0.986453 0.990096 0.985274 0.984587 0.984269 0.989460
Fold 3 0.956482 0.975630 0.975891 0.989293 0.980243 0.989838 0.984498 0.991673 0.984258 0.984340
Fold 4 0.964213 0.972436 0.978936 0.986719 0.984230 0.989184 0.986366 0.990025 0.986338 0.989577
Fold 5 0.964989 0.976846 0.983303 0.987393 0.976480 0.983704 0.984770 0.984648 0.988402 0.986043
Average of Folds 0.962902 0.974851 0.978526 0.986819 0.981901 0.988010 0.985119 0.988388 0.985664 0.987905
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
11: k=1 w+ n+ 12: k=1 w+ n- 13: k=3 w+ n+ 14: k=3 w+ n- 15: k=5 w+ n+ 16: k=5 w+ n- 17: k=7 w+ n+ 18: k=7 w+ n- 19: k=9 w+ n+ 20: k=9 w+ n-
Fold 1 0.979304 0.988582 0.984669 0.986152 0.983794 0.990322 0.986863 0.985967 0.987335 0.987637
Fold 2 0.988731 0.988752 0.988371 0.987625 0.984495 0.985386 0.984031 0.985918 0.979358 0.987495
Fold 3 0.985669 0.985824 0.985575 0.990030 0.985215 0.985110 0.983652 0.992063 0.984191 0.985655
Fold 4 0.978392 0.989751 0.981451 0.989091 0.987412 0.990144 0.982775 0.988243 0.986532 0.988647
Fold 5 0.985144 0.987582 0.982972 0.986831 0.981797 0.988959 0.983493 0.986273 0.982077 0.988344
Average of Folds 0.983448 0.988098 0.984608 0.987946 0.984543 0.987984 0.984163 0.987693 0.983899 0.987556
model variations encoding: 
 k=  : k parameter of KNN 
 w+  :       weighted KNN 
 w-  :   non-weighted KNN 
 n+  :    with feature normalization 
 n-  : without feature normalization 

Previously calculated:

Error Analysis for classification


All 20 KNN variations performed well considering results: accuracy > 96% , precision > 96% , recall > 96% at the same time.

Yet, overall, model no 8 is the best performing, having the highest accuracy-precision-recall at the same time. Model 8 parameters are: non-weighted KNN, k=7 , no feature normalization.


Important system parameters to consider for the classification:


1. k (number of neighbors) for the KNN algorithm

From the results we get from our experiments, there isn't necessarily a clear relation between the parameter k and accuracy-precision-recall. In our experiments, we have used the values (1, 3, 5, 7, 9) for variations of k, obtained very close results (differences mostly less than 0.005%).

Though, it can be said that, the general trend for the accuracy-precision-recall for the k parameter is: higher values of k results higher precision rates until k=7. In our implementation for the KNN classifier, there's no computational overhead of using greater k values. Performance-wise, since euclidian distances are calculated for all test samples against whole training-set, there is no difference.

2. Feature normalization

From the results we get from our experiments, it was seen that, using MinMax scaling decreased our prediction performance for this dataset. Without feature normalization we obtain slightly (around 0.005%) better accuracy-precision-recall results than with feature normalization.

As we have shown some answers for some questions range from narrower range than [-3,3] range. Without feature normalization, those questions were effecting the KNN prediction less than the questions which got more significant answers that are ranging between wider values.

Therefore, it can be said that questions with a small answer range are not as effective as questions with a wider answer range, in reality, in some sense. But it should be kept in mind that, we have only experimented with MinMaxScaler. And there can be some feature normalization algorithms which can increase our model's prediction performance.



3. Weighted / Uniform (non-Weighted) KNN

From the results we get from our experiments, there isn't necessarily a clear relation between weighted KNN and uniform (non-weighted) KNN. There are very slightly (0.005%) better/worse comparisons, yet they are insignificant to establish a pattern or trend.

That is interesting, and the explanation comes to our minds is that, the samples are distributed in an almost homogenous (uniform) way that distances are very similar values, thus all neighbors of a test sample are roughly equal distance from it.

That said, for the lower values of k neighbors parameter, weighted KNN performs around 0.01% better than uniform KNN (for all accuracy, precision, and recall). Our heuristic explanation is that, when there are fewer neighbors, classification can get effected by noise more. Thus weighted KNN is more noise-tolerable: noisy neighbours are somewhat muted because of their farther distance.

And for the higher values of k, uniform KNN gives slightly (0.005%) better prediction results than weighted KNN.



4. Distance metric that is used at KNN

From the results we get from our experiments, there isn't necessarily a clear relation between distance metric (euclidian, manhattan, etc.) and prediction success. We didn't include this variation in our tables to not overfill the table.

5. Number of folds at cross validation

The k-fold cross validation is well-known for evaluating models' real world performance more accurately. The more folds, the bigger portions of data can be trained-tested against each other. However, that comes with a cost of computation time. The computation time and amount of folds is directly proportional in that sense.

Comparison between individual folds does not yield any meaning since dataset is random shuffled before split. Average of folds is the decision metric for comparing performance of different model variations.



Our heuristic favorite parameters

As team "Epoche", our go-to parameters for this dataset: k=7, uniform distance (non-weighted) KNN ( since same classification performance, in less time to compute ), euclidean distance as distance metric, no feature scaling.

Comments on Misclassifications


Here, we will show a few misclassified samples, and ask their neighbors why they were misclassified.

print(" \n3 misclassified samples and their neighbors will be ready in about  1  MINUTE  of execution. Please wait... \n")

cv = KFold(5,shuffle=True, random_state=24)
X_train, X_test, y_train, y_test = next(cv.split(X,Y))
knnUniform = KNNClassifier(n_neighbors=7, weights='uniform_neighbors', n_classes=numberOfClasses)
knnUniform.fit(X_train, y_train)

misclassifiedNum = 1
predictions, neighbors = knnUniform.predict(X_test)

print("3 misclassified examples in  model no 8:  k=7, non-weighted KNN, no feature normalization:\n ")

for i in range(2000):
    if predictions[i] != y_test[i]:
        print("misclassified sample " + str(misclassifiedNum) +" :")
        print("  Predicted label: " + str(predictions[i]) + "(" + personality_types[predictions[i]] + ")")
        print("  Actual label:    " + str(y_test[i]) + "(" + personality_types[y_test[i]] + ")")
        print("  nearest Neighbours:  ", end =" ")
        for neighbour in neighbors[i]:
             print(str(y_train[neighbour]) + "(" + personality_types[y_train[neighbour]] + ")", end =" ")
        print("\n")
        misclassifiedNum +=1

    if misclassifiedNum >3:
        break
3 misclassified samples and their neighbors will be ready in about  1  MINUTE  of execution. Please wait... 

3 misclassified examples in  model no 8:  k=7, non-weighted KNN, no feature normalization:
 
misclassified sample 1 :
  Predicted label: 11(ENFP)
  Actual label:    8(ESTP)
  nearest Neighbours:   11(ENFP) 11(ENFP) 11(ENFP) 11(ENFP) 11(ENFP) 11(ENFP) 11(ENFP) 

misclassified sample 2 :
  Predicted label: 9(ESFP)
  Actual label:    2(ESFJ)
  nearest Neighbours:   9(ESFP) 9(ESFP) 9(ESFP) 9(ESFP) 9(ESFP) 9(ESFP) 9(ESFP) 

misclassified sample 3 :
  Predicted label: 4(ISTJ)
  Actual label:    14(INTP)
  nearest Neighbours:   4(ISTJ) 14(INTP) 4(ISTJ) 14(INTP) 4(ISTJ) 4(ISTJ) 14(INTP) 

Another previously calculated example:

<img src="