The dataset consists of 10k answers for 60 questions from the 16 Personality Test and their ground truth labels(Personality Types).
Answers to the questions are stored in following manner:
Fully Agree: 3
Partially Agree: 2
Slightly Agree: 1
Neutral: 0
Slightly disagree: -1
Partially disagree: -2
Fully disagree: -3
import pandas as pd
pd.set_option('display.precision', 6)
import numpy as np
df = pd.read_csv("subset_16P.csv", encoding='cp1252')
df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Response Id | You regularly make new friends. | You spend a lot of your free time exploring various random topics that pique your interest | Seeing other people cry can easily make you feel like you want to cry too | You often make a backup plan for a backup plan. | You usually stay calm, even under a lot of pressure | At social events, you rarely try to introduce yourself to new people and mostly talk to the ones you already know | You prefer to completely finish one project before starting another. | You are very sentimental. | You like to use organizing tools like schedules and lists. | ... | You believe that pondering abstract philosophical questions is a waste of time. | You feel more drawn to places with busy, bustling atmospheres than quiet, intimate places. | You know at first glance how someone is feeling. | You often feel overwhelmed. | You complete things methodically without skipping over any steps. | You are very intrigued by things labeled as controversial. | You would pass along a good opportunity if you thought someone else needed it more. | You struggle with deadlines. | You feel confident that things will work out for you. | Personality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 35874 | -1 | 0 | -1 | 1 | -1 | -2 | -2 | 0 | -1 | ... | 0 | 3 | 0 | 0 | 0 | 0 | 1 | -1 | 0 | ENTP |
1 | 42624 | 0 | 0 | 1 | 0 | 0 | 0 | -1 | 0 | 0 | ... | 0 | 2 | 0 | 0 | 0 | 0 | -1 | -3 | 2 | INTP |
2 | 55199 | 0 | 0 | -2 | -1 | 2 | -2 | 0 | 0 | -1 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 0 | 0 | ESTP |
3 | 52983 | 0 | 0 | 0 | 1 | -2 | -1 | 0 | 0 | 1 | ... | 1 | 1 | 0 | -1 | 0 | -1 | 2 | -2 | 0 | ENTP |
4 | 22864 | 0 | 0 | 2 | 1 | 0 | -2 | -1 | 0 | 1 | ... | 1 | -2 | 0 | 1 | 0 | 0 | 0 | -2 | 2 | ENFJ |
5 rows × 62 columns
In order to use string type ground truth labels in our ML algorithms more effectively, we preferred to convert them to integers ranged from 0 to 15.
numberOfClasses = len(df["Personality"].unique())
personality_types =[ "ESTJ", "ENTJ", "ESFJ", "ENFJ", "ISTJ", "ISFJ",
"INTJ", "INFJ", "ESTP", "ESFP", "ENTP", "ENFP",
"ISTP", "ISFP", "INTP", "INFP" ]
df.Personality = df.Personality.astype("category", personality_types).cat.codes
df.Personality.describe()
count 10000.000000
mean 7.500200
std 4.621811
min 0.000000
25% 3.000000
50% 8.000000
75% 12.000000
max 15.000000
Name: Personality, dtype: float64
df.Personality.value_counts()
9 662
3 641
15 641
7 640
0 632
1 629
14 625
8 625
11 624
12 624
2 622
4 620
13 616
5 615
10 596
6 588
Name: Personality, dtype: int64
Later, we splitted the dataset to X(question answers) and y(ground truth Personality labels).
X = df.drop(['Response Id','Personality'], axis=1)
y = df.Personality
X= X.to_numpy()
Y= y.to_numpy()
Some machine learning algorithms are sensitive to feature scaling while others are virtually invariant to it. Since we are using KNN classifier in this project, if we don't use any feature normalization methods, some features of the dataset can influence the prediction more than other features and this generally isn't a thing we want.
df.describe()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Response Id | You regularly make new friends. | You spend a lot of your free time exploring various random topics that pique your interest | Seeing other people cry can easily make you feel like you want to cry too | You often make a backup plan for a backup plan. | You usually stay calm, even under a lot of pressure | At social events, you rarely try to introduce yourself to new people and mostly talk to the ones you already know | You prefer to completely finish one project before starting another. | You are very sentimental. | You like to use organizing tools like schedules and lists. | ... | You believe that pondering abstract philosophical questions is a waste of time. | You feel more drawn to places with busy, bustling atmospheres than quiet, intimate places. | You know at first glance how someone is feeling. | You often feel overwhelmed. | You complete things methodically without skipping over any steps. | You are very intrigued by things labeled as controversial. | You would pass along a good opportunity if you thought someone else needed it more. | You struggle with deadlines. | You feel confident that things will work out for you. | Personality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 10000.000000 | 10000.00000 | 10000.000000 | 10000.00000 | 10000.000000 | 10000.00000 | 10000.000000 | 10000.00000 | 10000.000000 | 10000.000000 | ... | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.0000 | 10000.000000 | 10000.000000 | 10000.000000 |
mean | 30033.526600 | -0.00420 | 0.002100 | 0.01470 | -0.211000 | -0.14970 | 0.012500 | -0.45950 | 0.002400 | 0.130300 | ... | 0.000700 | 0.123400 | -0.002900 | 0.258500 | -0.004600 | -0.002400 | 0.1192 | -0.027200 | 0.100300 | 7.500200 |
std | 17310.103985 | 0.37013 | 0.370013 | 1.53796 | 1.523388 | 1.49416 | 1.514983 | 1.45278 | 0.362777 | 1.535629 | ... | 0.364572 | 1.528073 | 0.371087 | 1.495494 | 0.363857 | 0.368792 | 1.5250 | 1.531305 | 1.561885 | 4.621811 |
min | 0.000000 | -1.00000 | -1.000000 | -3.00000 | -3.000000 | -3.00000 | -3.000000 | -3.00000 | -1.000000 | -3.000000 | ... | -1.000000 | -3.000000 | -1.000000 | -3.000000 | -1.000000 | -1.000000 | -3.0000 | -3.000000 | -3.000000 | 0.000000 |
25% | 15058.750000 | 0.00000 | 0.000000 | -1.00000 | -1.000000 | -1.00000 | -1.000000 | -2.00000 | 0.000000 | -1.000000 | ... | 0.000000 | -1.000000 | 0.000000 | -1.000000 | 0.000000 | 0.000000 | -1.0000 | -1.000000 | -1.000000 | 3.000000 |
50% | 29961.500000 | 0.00000 | 0.000000 | 0.00000 | 0.000000 | 0.00000 | 0.000000 | -1.00000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 8.000000 |
75% | 45206.750000 | 0.00000 | 0.000000 | 1.00000 | 1.000000 | 1.00000 | 1.000000 | 0.00000 | 0.000000 | 1.000000 | ... | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 1.0000 | 1.000000 | 1.000000 | 12.000000 |
max | 59997.000000 | 1.00000 | 1.000000 | 3.00000 | 3.000000 | 3.00000 | 3.000000 | 3.00000 | 1.000000 | 3.000000 | ... | 1.000000 | 3.000000 | 1.000000 | 3.000000 | 1.000000 | 1.000000 | 3.0000 | 3.000000 | 3.000000 | 15.000000 |
8 rows × 62 columns
As can seen from the table, the distribution of answers for each column is different. Even though possible answer scores range from -3 to 3, answers for some questions are ranging between different numbers. Thus, when we don't use feature normalization, some columns are going to influence outcome more.
In this project, we will be using MinMaxScaler for our feature normalization algorithm.
MinMaxScaler shrinks the data within the given range, usually 0 to 1. In this project we will shrink each column to 0 to 1 range with the formula given below.
Another important point to mention is that when scaling your train and test datasets, you need to avoid information leakage onto the test dataset. So if you scale your test dataset with the min max values from test dataset itself, you leak information of min max values of the whole test dataset, through your model and it's a bad practice. Thus you must use the min max values from the training dataset while scaling.
class MinMaxScaler():
def __init__(self):
self.mins = []
self.maxes = []
def fit_transform(self, X):
self.mins = X.min(axis=0)
self.maxes = X.max(axis=0)
maxMinusMin = self.maxes - self.mins
return (X - self.mins) / maxMinusMin
def transform(self, X):
maxMinusMin = self.maxes - self.mins
return (X - self.mins) / maxMinusMin
Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.
The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.
Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data. That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.
It is a popular method because it is simple to understand and because it generally results in a less biased or less optimistic estimate of the model skill than other methods, such as a simple train/test split.
The general procedure is as follows:
- Shuffle the dataset randomly.
- Split the dataset into k groups
- For each unique group:
- Take the group as a hold out or test data set
- Take the remaining groups as a training data set
- Fit a model on the training set and evaluate it on the test set
- Retain the evaluation score and discard the model
- Summarize the skill of the model using the sample of model evaluation scores
[source] (https://machinelearningmastery.com/k-fold-cross-validation/)
import random
class KFold():
def __init__(self,n_splits=5, shuffle=True, random_state=42):
self.shuffle = shuffle
self.n_splits=n_splits
self.random_state= random_state
# Fisher-Yates Shuffle Algorithm
def shuffler (self, arr, n):
random.seed(n)
rowSize = arr.shape[0]
for i in range(rowSize-1,0,-1):
# random index from 0 to i
j = random.randint(0,i+1)
# Swap with random index
arr[[i, j]] = arr[[j, i]]
return arr
def split(self, X, y):
if(self.shuffle):
X = self.shuffler(X, self.random_state)
y = self.shuffler(y, self.random_state)
rowSize = len(X)
testSetSize = rowSize // self.n_splits
for i in range(self.n_splits):
if(i==0):
x_train = X[(i+1)*testSetSize :,]
y_train = Y[(i+1)*testSetSize :,]
elif(i==self.n_splits-1):
x_train = X[:i*testSetSize,]
y_train = Y[:i*testSetSize,]
else:
# [ row1,row2, ..., x_train_rows, rowk, ...]
# appending rows prior to x_train with rows comes after x_train
x_train_smaller_indices = X[:i*testSetSize,]
y_train_smaller_indices = Y[:i*testSetSize,]
x_train = np.append(
x_train_smaller_indices, X[(i+1)*testSetSize :,], axis = 0
)
y_train = np.append(
y_train_smaller_indices, Y[(i+1)*testSetSize :,], axis = 0
)
if(i!=self.n_splits-1):
x_test = X[i*testSetSize : (i+1)*testSetSize,]
y_test = Y[i*testSetSize : (i+1)*testSetSize,]
else:
# because we calculate testSetSize with //,
# last split must finish through the end of the whole array
x_test = X[i*testSetSize :,]
y_test = Y[i*testSetSize :,]
yield (x_train, x_test, y_train, y_test)
KNN is an instance-based learning method. Instance-based learning (memory-based learning, lazy learning) is a family of learning algorithms that, instead of performing explicit generalization, compare new problem instances with instances seen in training, which have been stored in memory.
There are efficient implementations to store the data using complex data structures like k-d trees to make look-up and matching of new patterns during prediction more efficient. But in this project we will be making use of basic numpy arrays.
Predictions are made for a new instance (x) by searching through the entire training set for the K most similar instances (the neighbors) and summarizing the output variable for those K instances. For regression this might be the mean output variable, in classification this might be the mod (or most common) class value.
To determine which of the K instances in the training dataset are most similar to a new input, a distance measure is used. The most popular distance measures are Euclidean Distance, Manhattan Distance, Minkowski Distance, Hamming Distance. In this project we will be using Euclidean distance.
class KNNClassifier():
def __init__(self, n_neighbors=5, weights='uniform', n_classes = 16):
self.X_train = None
self.y_train = None
self.n_classes = n_classes
self.n_neighbors = n_neighbors
self.weights = weights
def fit(self, X_train, y_train):
self.X_train = X_train
self.y_train = y_train
def euclidian_distance(self, a, b):
distances = np.sqrt(np.sum((a - b)**2, axis=1))
# prevent division by zero
distances[np.where(distances < 0.00001)] = 0.00001
return distances
def kneighbors(self, X_test, return_distance=False):
dist = []
neigh_ind = []
point_dist = [self.euclidian_distance(x_test, self.X_train) for x_test in X_test]
for row in point_dist:
enum_neigh = enumerate(row)
sorted_neigh = sorted(enum_neigh,
key=lambda x: x[1])[:self.n_neighbors]
ind_list = [tup[0] for tup in sorted_neigh]
dist_list = [tup[1] for tup in sorted_neigh]
dist.append(dist_list)
neigh_ind.append(ind_list)
if return_distance:
return np.array(dist), np.array(neigh_ind)
return np.array(neigh_ind)
def predict(self, X_test):
# non-weighted knn, majority voting of neighbors for classification
if self.weights == 'uniform':
neighbors = self.kneighbors(X_test)
y_pred = np.array([
np.argmax(np.bincount(self.y_train[neighbor]))
for neighbor in neighbors
])
return y_pred
# weighted knn, voting based on weights of neighbors
elif self.weights == 'distance':
dist, neigh_ind = self.kneighbors(X_test, return_distance=True)
inv_dist = 1 / dist
mean_inv_dist = inv_dist / np.sum(inv_dist, axis=1)[:, np.newaxis]
proba = []
for i, row in enumerate(mean_inv_dist):
row_pred = self.y_train[neigh_ind[i]]
for k in range(self.n_classes):
indices = np.where(row_pred == k)
prob_ind = np.sum(row[indices])
proba.append(np.array(prob_ind))
predict_proba = np.array(proba).reshape(X_test.shape[0],
self.n_classes)
y_pred = np.array([np.argmax(item) for item in predict_proba])
return y_pred
# used for interpretation of misclassified samples, return also nearest neighbors
elif self.weights == 'uniform_neighbors':
neighbors = self.kneighbors(X_test) # nearestNeighborsIndices_of_all_testSamples
y_pred = np.array([
np.argmax(np.bincount(self.y_train[neighbor]))
for neighbor in neighbors
])
return y_pred, neighbors
A pipeline is a linear sequence of data preparation options, modeling operations, and prediction transform operations.
It allows the sequence of steps to be specified, evaluated, and used as an atomic unit. Like:
- [Input], [Normalization], [KNN Classifier], [Predictions]
- [Input], [Standardization], [RFE], [SVM], [Predictions]
# from scipy import stats
class Pipeline():
def __init__(self, scaler=None, classifier=None):
self.scaler = scaler
self.classifier = classifier
def execute(self,x_train, x_test, y_train):
if(self.scaler is not None):
x_train = self.scaler.fit_transform(x_train)
x_test = self.scaler.transform(x_test)
if(self.classifier is not None):
self.classifier.fit(x_train, y_train)
return self.classifier.predict(x_test)
A classifier is only as important as the metric used to evaluate it.
If you choose the wrong metric to evaluate your models, you are likely to choose a poor model, or in the worst case, be misled about the expected performance of your model.
And choosing the right classification metric is particularly difficult for imbalanced classification problems. Firstly, because most of the standard metrics that are widely used assume a balanced class distribution, and because typically not all classes, and therefore, not all prediction errors, are equal for imbalanced classification.
In this project we will be using Accuracy, Precision and Recall metrics to evaluate our ML models' predictions.
def accuracy(pred, actual):
return sum(pred == actual) / len(pred)
Since there are 16 ground truth labels, we take the average precision of all labels.
def precision(pred, actual):
if(len(pred) == 0 or len(pred) != len(actual)):
return -1
labels= []
truePositivesPerLabel = {}
falsePositivesPerLabel = {}
precisionPerLabel = {}
for i in range(len(pred)):
prediction = pred[i]
if prediction not in labels:
labels.append(prediction)
truePositivesPerLabel[prediction] = 0
falsePositivesPerLabel[prediction] = 0
if(pred[i] == actual[i]):
truePositivesPerLabel[prediction] +=1
else:
falsePositivesPerLabel[prediction] +=1
# count of the labels that are existed inside the ground truth or prediction
existedLabelCount = 0
precisionSum = 0
for label in labels:
denominator = truePositivesPerLabel[label] + falsePositivesPerLabel[label]
if(denominator >=0):
existedLabelCount +=1
precisionSum += truePositivesPerLabel[label] / denominator
return precisionSum / existedLabelCount
Since there are 16 ground truth labels, we take the average recall precision of all labels.
def recall(pred, actual):
if(len(pred) == 0 or len(pred) != len(actual)):
return -1
labels= []
truePositivesPerLabel = {}
falseNegativesPerLabel = {}
recallPerLabel = {}
for i in range(len(actual)):
actualClass = actual[i]
if actualClass not in labels:
labels.append(actualClass)
truePositivesPerLabel[actualClass] = 0
falseNegativesPerLabel[actualClass] = 0
if(pred[i] == actual[i]):
truePositivesPerLabel[actualClass] +=1
else:
falseNegativesPerLabel[actualClass] +=1
# count of the labels that are existed inside the ground truth or prediction
existedLabelCount = 0
recallSum = 0
for label in labels:
denominator = truePositivesPerLabel[label] + falseNegativesPerLabel[label]
if(denominator >=0):
existedLabelCount +=1
recallSum += truePositivesPerLabel[label] / denominator
return recallSum / existedLabelCount
When evaluating different settings (“hyperparameters”) for estimators, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance. To solve this problem, yet another part of the dataset can be held out as a so-called “validation set”: training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set.
However, by partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets.
A solution to this problem is a procedure called cross-validation. A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller setss. The following procedure is followed for each of the k “folds”:
- A model is trained using of the folds as training data;
- the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).
def cross_val_score(X, Y, cv, pipeline):
accuracy_folds = []
precision_folds = []
recall_folds = []
# for each Fold of 5-fold-validation
for (x_train, x_test, y_train, y_test) in cv.split(X,Y):
y_pred = pipeline.execute(x_train, x_test, y_train)
accuracy_folds.append(accuracy(y_pred, y_test))
precision_folds.append(recall(y_pred, y_test))
recall_folds.append(precision(y_pred, y_test))
# averages of folds
accuracy_folds.append(sum(accuracy_folds)/5)
precision_folds.append(sum(precision_folds)/5)
recall_folds.append(sum(recall_folds)/5)
return accuracy_folds, precision_folds, recall_folds
Now we will be comparing our models' performance with/without feature normalization and with different k_neighbors values as KNNClassifier parameter.
cv = KFold(5, shuffle=True, random_state=24)
scaler = MinMaxScaler()
neighborVariations = [1,3,5,7,9]
accuracy_table_columns = []
precision_table_columns = []
recall_table_columns = []
import time
def run_all_models():
print(" \nResults of 20 KNN model variations will be ready after ABOUT 25 MINUTES of execution. Please wait... \n")
progress = 1
start = time.time()
""" *** NON-WEIGHTED KNN *** """
for k in neighborVariations: # THIS LOOP TAKES ABOUT 10 MINUTES TO COMPLETE
knnUniform = KNNClassifier(n_neighbors=k, weights='uniform', n_classes=numberOfClasses)
# with feature normalization
print(" KNN model variation no: " + str(progress) + " is started being processed..." )
pipeline = Pipeline(scaler=scaler, classifier=knnUniform)
accuracies, precisions, recalls = cross_val_score(X, Y, cv, pipeline)
accuracy_table_columns.append(accuracies)
precision_table_columns.append(precisions)
recall_table_columns.append(recalls)
print(" KNN model variation no: " + str(progress) + " processing is finished.\n" )
progress += 1
# without feature normalization
print(" KNN model variation no: " + str(progress) + " is started being processed..." )
pipeline = Pipeline(classifier=knnUniform)
accuracies, precisions, recalls = cross_val_score(X, Y, cv, pipeline)
accuracy_table_columns.append(accuracies)
precision_table_columns.append(precisions)
recall_table_columns.append(recalls)
print(" KNN model variation no: " + str(progress) + " processing is finished.\n" )
progress += 1
""" *** WEIGHTED KNN *** """
for k in neighborVariations: # THIS LOOP TAKES ABOUT 15 MINUTES TO COMPLETE
knnDistance = KNNClassifier(n_neighbors=5, weights='distance', n_classes=numberOfClasses)
# with feature normalization
print(" KNN model variation no: " + str(progress) + " is started being processed..." )
pipeline = Pipeline(scaler=scaler, classifier=knnDistance)
accuracies, precisions, recalls = cross_val_score(X, Y, cv, pipeline)
accuracy_table_columns.append(accuracies)
precision_table_columns.append(precisions)
recall_table_columns.append(recalls)
print(" KNN model variation no: " + str(progress) + " processing is finished.\n" )
progress += 1
# without feature normalization
print(" KNN model variation no: " + str(progress) + " is started being processed..." )
pipeline = Pipeline(classifier=knnDistance)
accuracies, precisions, recalls = cross_val_score(X, Y, cv, pipeline)
accuracy_table_columns.append(accuracies)
precision_table_columns.append(precisions)
recall_table_columns.append(recalls)
print(" KNN model variation no: " + str(progress) + " processing is finished.\n" )
progress += 1
# model calculations are finished.
finish = time.time()
seconds = finish-start
minutes = seconds//60
seconds -= 60*minutes
print("Results of 20 KNN model variations are ready in the sections below. Thank you for your patience.")
print('Elapsed time is: %d:%d minutes:seconds' %(minutes,seconds))
run_all_models()
Results of 20 KNN model variations will be ready after ABOUT 25 MINUTES of execution. Please wait...
KNN model variation no: 1 is started being processed...
KNN model variation no: 1 processing is finished.
KNN model variation no: 2 is started being processed...
KNN model variation no: 2 processing is finished.
KNN model variation no: 3 is started being processed...
KNN model variation no: 3 processing is finished.
KNN model variation no: 4 is started being processed...
KNN model variation no: 4 processing is finished.
KNN model variation no: 5 is started being processed...
KNN model variation no: 5 processing is finished.
KNN model variation no: 6 is started being processed...
KNN model variation no: 6 processing is finished.
KNN model variation no: 7 is started being processed...
KNN model variation no: 7 processing is finished.
KNN model variation no: 8 is started being processed...
KNN model variation no: 8 processing is finished.
KNN model variation no: 9 is started being processed...
KNN model variation no: 9 processing is finished.
KNN model variation no: 10 is started being processed...
KNN model variation no: 10 processing is finished.
KNN model variation no: 11 is started being processed...
KNN model variation no: 11 processing is finished.
KNN model variation no: 12 is started being processed...
KNN model variation no: 12 processing is finished.
KNN model variation no: 13 is started being processed...
KNN model variation no: 13 processing is finished.
KNN model variation no: 14 is started being processed...
KNN model variation no: 14 processing is finished.
KNN model variation no: 15 is started being processed...
KNN model variation no: 15 processing is finished.
KNN model variation no: 16 is started being processed...
KNN model variation no: 16 processing is finished.
KNN model variation no: 17 is started being processed...
KNN model variation no: 17 processing is finished.
KNN model variation no: 18 is started being processed...
KNN model variation no: 18 processing is finished.
KNN model variation no: 19 is started being processed...
KNN model variation no: 19 processing is finished.
KNN model variation no: 20 is started being processed...
KNN model variation no: 20 processing is finished.
Results of 20 KNN model variations are ready in the sections below. Thank you for your patience.
Elapsed time is: 30:54 minutes:seconds
def draw_accuracy_table():
print("------ Accuracy - for 20 model variations ------")
accuracy_rows = np.transpose(np.array(accuracy_table_columns))
accuracy_table = pd.DataFrame(accuracy_rows, columns = ['1: k=1 w- n+','2: k=1 w- n-','3: k=3 w- n+','4: k=3 w- n-','5: k=5 w- n+','6: k=5 w- n-','7: k=7 w- n+','8: k=7 w- n-','9: k=9 w- n+','10: k=9 w- n-','11: k=1 w+ n+','12: k=1 w+ n-','13: k=3 w+ n+','14: k=3 w+ n-','15: k=5 w+ n+','16: k=5 w+ n-','17: k=7 w+ n+','18: k=7 w+ n-','19: k=9 w+ n+','20: k=9 w+ n-'])
accuracy_table.index = ['Fold 1', 'Fold 2', 'Fold 3', 'Fold 4', 'Fold 5', 'Average of Folds']
display(accuracy_table.iloc[:, :10].head(6))
display(accuracy_table.iloc[:, 10:].head(6))
print("model variations encoding: \n k= : k parameter of KNN \n w+ : weighted KNN \n w- : non-weighted KNN \n n+ : with feature normalization \n n- : without feature normalization \n")
draw_accuracy_table()
------ Accuracy - for 20 model variations ------
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
1: k=1 w- n+ | 2: k=1 w- n- | 3: k=3 w- n+ | 4: k=3 w- n- | 5: k=5 w- n+ | 6: k=5 w- n- | 7: k=7 w- n+ | 8: k=7 w- n- | 9: k=9 w- n+ | 10: k=9 w- n- | |
---|---|---|---|---|---|---|---|---|---|---|
Fold 1 | 0.9645 | 0.9750 | 0.9790 | 0.9835 | 0.9820 | 0.9870 | 0.9850 | 0.9910 | 0.9850 | 0.9900 |
Fold 2 | 0.9635 | 0.9740 | 0.9750 | 0.9875 | 0.9865 | 0.9900 | 0.9850 | 0.9840 | 0.9845 | 0.9895 |
Fold 3 | 0.9565 | 0.9760 | 0.9760 | 0.9890 | 0.9800 | 0.9900 | 0.9845 | 0.9915 | 0.9845 | 0.9840 |
Fold 4 | 0.9640 | 0.9725 | 0.9790 | 0.9865 | 0.9840 | 0.9890 | 0.9860 | 0.9900 | 0.9860 | 0.9895 |
Fold 5 | 0.9650 | 0.9770 | 0.9830 | 0.9875 | 0.9765 | 0.9835 | 0.9845 | 0.9850 | 0.9880 | 0.9865 |
Average of Folds | 0.9627 | 0.9749 | 0.9784 | 0.9868 | 0.9818 | 0.9879 | 0.9850 | 0.9883 | 0.9856 | 0.9879 |
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
11: k=1 w+ n+ | 12: k=1 w+ n- | 13: k=3 w+ n+ | 14: k=3 w+ n- | 15: k=5 w+ n+ | 16: k=5 w+ n- | 17: k=7 w+ n+ | 18: k=7 w+ n- | 19: k=9 w+ n+ | 20: k=9 w+ n- | |
---|---|---|---|---|---|---|---|---|---|---|
Fold 1 | 0.9790 | 0.9885 | 0.9845 | 0.9860 | 0.9840 | 0.9905 | 0.9870 | 0.9855 | 0.9875 | 0.9875 |
Fold 2 | 0.9885 | 0.9890 | 0.9885 | 0.9875 | 0.9845 | 0.9855 | 0.9845 | 0.9860 | 0.9795 | 0.9875 |
Fold 3 | 0.9860 | 0.9860 | 0.9855 | 0.9895 | 0.9850 | 0.9850 | 0.9835 | 0.9920 | 0.9840 | 0.9855 |
Fold 4 | 0.9785 | 0.9900 | 0.9815 | 0.9890 | 0.9875 | 0.9900 | 0.9830 | 0.9880 | 0.9865 | 0.9890 |
Fold 5 | 0.9850 | 0.9875 | 0.9830 | 0.9870 | 0.9820 | 0.9890 | 0.9835 | 0.9860 | 0.9815 | 0.9880 |
Average of Folds | 0.9834 | 0.9882 | 0.9846 | 0.9878 | 0.9846 | 0.9880 | 0.9843 | 0.9875 | 0.9838 | 0.9875 |
model variations encoding:
k= : k parameter of KNN
w+ : weighted KNN
w- : non-weighted KNN
n+ : with feature normalization
n- : without feature normalization
def draw_precision_table():
print("------ Precision - for 20 model variations ------")
precision_rows = np.transpose(np.array(precision_table_columns))
precision_table = pd.DataFrame(precision_rows, columns = ['1: k=1 w- n+','2: k=1 w- n-','3: k=3 w- n+','4: k=3 w- n-','5: k=5 w- n+','6: k=5 w- n-','7: k=7 w- n+','8: k=7 w- n-','9: k=9 w- n+','10: k=9 w- n-','11: k=1 w+ n+','12: k=1 w+ n-','13: k=3 w+ n+','14: k=3 w+ n-','15: k=5 w+ n+','16: k=5 w+ n-','17: k=7 w+ n+','18: k=7 w+ n-','19: k=9 w+ n+','20: k=9 w+ n-'])
precision_table.index = ['Fold 1', 'Fold 2', 'Fold 3', 'Fold 4', 'Fold 5', 'Average of Folds']
display(precision_table.iloc[:, :10].head(6))
display(precision_table.iloc[:, 10:].head(6))
print("model variations encoding: \n k= : k parameter of KNN \n w+ : weighted KNN \n w- : non-weighted KNN \n n+ : with feature normalization \n n- : without feature normalization \n")
draw_precision_table()
------ Precision - for 20 model variations ------
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
1: k=1 w- n+ | 2: k=1 w- n- | 3: k=3 w- n+ | 4: k=3 w- n- | 5: k=5 w- n+ | 6: k=5 w- n- | 7: k=7 w- n+ | 8: k=7 w- n- | 9: k=9 w- n+ | 10: k=9 w- n- | |
---|---|---|---|---|---|---|---|---|---|---|
Fold 1 | 0.964931 | 0.975168 | 0.979575 | 0.983745 | 0.982096 | 0.987054 | 0.984926 | 0.991138 | 0.985189 | 0.990488 |
Fold 2 | 0.963493 | 0.974289 | 0.975090 | 0.987569 | 0.986388 | 0.990114 | 0.985068 | 0.984485 | 0.984691 | 0.989504 |
Fold 3 | 0.956300 | 0.975652 | 0.975711 | 0.988692 | 0.980101 | 0.990182 | 0.984816 | 0.991562 | 0.984917 | 0.984084 |
Fold 4 | 0.964323 | 0.972851 | 0.978742 | 0.986759 | 0.983876 | 0.988699 | 0.985986 | 0.990225 | 0.986165 | 0.989465 |
Fold 5 | 0.965049 | 0.977203 | 0.983165 | 0.988015 | 0.976901 | 0.983499 | 0.984388 | 0.984856 | 0.987689 | 0.986730 |
Average of Folds | 0.962819 | 0.975033 | 0.978457 | 0.986956 | 0.981872 | 0.987910 | 0.985037 | 0.988453 | 0.985730 | 0.988054 |
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
11: k=1 w+ n+ | 12: k=1 w+ n- | 13: k=3 w+ n+ | 14: k=3 w+ n- | 15: k=5 w+ n+ | 16: k=5 w+ n- | 17: k=7 w+ n+ | 18: k=7 w+ n- | 19: k=9 w+ n+ | 20: k=9 w+ n- | |
---|---|---|---|---|---|---|---|---|---|---|
Fold 1 | 0.979140 | 0.988587 | 0.984398 | 0.985933 | 0.983995 | 0.990601 | 0.987326 | 0.985271 | 0.987389 | 0.988169 |
Fold 2 | 0.988541 | 0.988852 | 0.988575 | 0.987664 | 0.984403 | 0.985467 | 0.984774 | 0.986314 | 0.979564 | 0.987426 |
Fold 3 | 0.985990 | 0.986343 | 0.985762 | 0.989589 | 0.984820 | 0.985038 | 0.983225 | 0.992135 | 0.984215 | 0.985262 |
Fold 4 | 0.978856 | 0.989977 | 0.982044 | 0.989305 | 0.987463 | 0.989724 | 0.982838 | 0.987977 | 0.986498 | 0.988880 |
Fold 5 | 0.985090 | 0.987503 | 0.983079 | 0.986972 | 0.982141 | 0.989050 | 0.983022 | 0.986275 | 0.981226 | 0.987906 |
Average of Folds | 0.983523 | 0.988252 | 0.984772 | 0.987893 | 0.984564 | 0.987976 | 0.984237 | 0.987594 | 0.983779 | 0.987529 |
model variations encoding:
k= : k parameter of KNN
w+ : weighted KNN
w- : non-weighted KNN
n+ : with feature normalization
n- : without feature normalization
def draw_recall_table():
print("------ Recall - for 20 model variations ------")
recall_rows = np.transpose(np.array(recall_table_columns))
recall_table = pd.DataFrame(recall_rows, columns = ['1: k=1 w- n+','2: k=1 w- n-','3: k=3 w- n+','4: k=3 w- n-','5: k=5 w- n+','6: k=5 w- n-','7: k=7 w- n+','8: k=7 w- n-','9: k=9 w- n+','10: k=9 w- n-','11: k=1 w+ n+','12: k=1 w+ n-','13: k=3 w+ n+','14: k=3 w+ n-','15: k=5 w+ n+','16: k=5 w+ n-','17: k=7 w+ n+','18: k=7 w+ n-','19: k=9 w+ n+','20: k=9 w+ n-'])
recall_table.index = ['Fold 1', 'Fold 2', 'Fold 3', 'Fold 4', 'Fold 5', 'Average of Folds']
display(recall_table.iloc[:, :10].head(6))
display(recall_table.iloc[:, 10:].head(6))
print("model variations encoding: \n k= : k parameter of KNN \n w+ : weighted KNN \n w- : non-weighted KNN \n n+ : with feature normalization \n n- : without feature normalization \n")
draw_recall_table()
------ Recall - for 20 model variations ------
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
1: k=1 w- n+ | 2: k=1 w- n- | 3: k=3 w- n+ | 4: k=3 w- n- | 5: k=5 w- n+ | 6: k=5 w- n- | 7: k=7 w- n+ | 8: k=7 w- n- | 9: k=9 w- n+ | 10: k=9 w- n- | |
---|---|---|---|---|---|---|---|---|---|---|
Fold 1 | 0.964879 | 0.974979 | 0.979374 | 0.983124 | 0.982098 | 0.987231 | 0.984686 | 0.991006 | 0.985051 | 0.990108 |
Fold 2 | 0.963946 | 0.974364 | 0.975125 | 0.987568 | 0.986453 | 0.990096 | 0.985274 | 0.984587 | 0.984269 | 0.989460 |
Fold 3 | 0.956482 | 0.975630 | 0.975891 | 0.989293 | 0.980243 | 0.989838 | 0.984498 | 0.991673 | 0.984258 | 0.984340 |
Fold 4 | 0.964213 | 0.972436 | 0.978936 | 0.986719 | 0.984230 | 0.989184 | 0.986366 | 0.990025 | 0.986338 | 0.989577 |
Fold 5 | 0.964989 | 0.976846 | 0.983303 | 0.987393 | 0.976480 | 0.983704 | 0.984770 | 0.984648 | 0.988402 | 0.986043 |
Average of Folds | 0.962902 | 0.974851 | 0.978526 | 0.986819 | 0.981901 | 0.988010 | 0.985119 | 0.988388 | 0.985664 | 0.987905 |
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
11: k=1 w+ n+ | 12: k=1 w+ n- | 13: k=3 w+ n+ | 14: k=3 w+ n- | 15: k=5 w+ n+ | 16: k=5 w+ n- | 17: k=7 w+ n+ | 18: k=7 w+ n- | 19: k=9 w+ n+ | 20: k=9 w+ n- | |
---|---|---|---|---|---|---|---|---|---|---|
Fold 1 | 0.979304 | 0.988582 | 0.984669 | 0.986152 | 0.983794 | 0.990322 | 0.986863 | 0.985967 | 0.987335 | 0.987637 |
Fold 2 | 0.988731 | 0.988752 | 0.988371 | 0.987625 | 0.984495 | 0.985386 | 0.984031 | 0.985918 | 0.979358 | 0.987495 |
Fold 3 | 0.985669 | 0.985824 | 0.985575 | 0.990030 | 0.985215 | 0.985110 | 0.983652 | 0.992063 | 0.984191 | 0.985655 |
Fold 4 | 0.978392 | 0.989751 | 0.981451 | 0.989091 | 0.987412 | 0.990144 | 0.982775 | 0.988243 | 0.986532 | 0.988647 |
Fold 5 | 0.985144 | 0.987582 | 0.982972 | 0.986831 | 0.981797 | 0.988959 | 0.983493 | 0.986273 | 0.982077 | 0.988344 |
Average of Folds | 0.983448 | 0.988098 | 0.984608 | 0.987946 | 0.984543 | 0.987984 | 0.984163 | 0.987693 | 0.983899 | 0.987556 |
model variations encoding:
k= : k parameter of KNN
w+ : weighted KNN
w- : non-weighted KNN
n+ : with feature normalization
n- : without feature normalization
All 20 KNN variations performed well considering results: accuracy > 96% , precision > 96% , recall > 96% at the same time.
Yet, overall, model no 8 is the best performing, having the highest accuracy-precision-recall at the same time. Model 8 parameters are: non-weighted KNN, k=7 , no feature normalization.
From the results we get from our experiments, there isn't necessarily a clear relation between the parameter k and accuracy-precision-recall. In our experiments, we have used the values (1, 3, 5, 7, 9) for variations of k, obtained very close results (differences mostly less than 0.005%).
Though, it can be said that, the general trend for the accuracy-precision-recall for the k parameter is: higher values of k results higher precision rates until k=7. In our implementation for the KNN classifier, there's no computational overhead of using greater k values. Performance-wise, since euclidian distances are calculated for all test samples against whole training-set, there is no difference.
From the results we get from our experiments, it was seen that, using MinMax scaling decreased our prediction performance for this dataset. Without feature normalization we obtain slightly (around 0.005%) better accuracy-precision-recall results than with feature normalization.
As we have shown some answers for some questions range from narrower range than [-3,3] range. Without feature normalization, those questions were effecting the KNN prediction less than the questions which got more significant answers that are ranging between wider values.
Therefore, it can be said that questions with a small answer range are not as effective as questions with a wider answer range, in reality, in some sense. But it should be kept in mind that, we have only experimented with MinMaxScaler. And there can be some feature normalization algorithms which can increase our model's prediction performance.
From the results we get from our experiments, there isn't necessarily a clear relation between weighted KNN and uniform (non-weighted) KNN. There are very slightly (0.005%) better/worse comparisons, yet they are insignificant to establish a pattern or trend.
That is interesting, and the explanation comes to our minds is that, the samples are distributed in an almost homogenous (uniform) way that distances are very similar values, thus all neighbors of a test sample are roughly equal distance from it.
That said, for the lower values of k neighbors parameter, weighted KNN performs around 0.01% better than uniform KNN (for all accuracy, precision, and recall). Our heuristic explanation is that, when there are fewer neighbors, classification can get effected by noise more. Thus weighted KNN is more noise-tolerable: noisy neighbours are somewhat muted because of their farther distance.
And for the higher values of k, uniform KNN gives slightly (0.005%) better prediction results than weighted KNN.
From the results we get from our experiments, there isn't necessarily a clear relation between distance metric (euclidian, manhattan, etc.) and prediction success. We didn't include this variation in our tables to not overfill the table.
The k-fold cross validation is well-known for evaluating models' real world performance more accurately. The more folds, the bigger portions of data can be trained-tested against each other. However, that comes with a cost of computation time. The computation time and amount of folds is directly proportional in that sense.
Comparison between individual folds does not yield any meaning since dataset is random shuffled before split. Average of folds is the decision metric for comparing performance of different model variations.
As team "Epoche", our go-to parameters for this dataset: k=7, uniform distance (non-weighted) KNN ( since same classification performance, in less time to compute ), euclidean distance as distance metric, no feature scaling.
Here, we will show a few misclassified samples, and ask their neighbors why they were misclassified.
print(" \n3 misclassified samples and their neighbors will be ready in about 1 MINUTE of execution. Please wait... \n")
cv = KFold(5,shuffle=True, random_state=24)
X_train, X_test, y_train, y_test = next(cv.split(X,Y))
knnUniform = KNNClassifier(n_neighbors=7, weights='uniform_neighbors', n_classes=numberOfClasses)
knnUniform.fit(X_train, y_train)
misclassifiedNum = 1
predictions, neighbors = knnUniform.predict(X_test)
print("3 misclassified examples in model no 8: k=7, non-weighted KNN, no feature normalization:\n ")
for i in range(2000):
if predictions[i] != y_test[i]:
print("misclassified sample " + str(misclassifiedNum) +" :")
print(" Predicted label: " + str(predictions[i]) + "(" + personality_types[predictions[i]] + ")")
print(" Actual label: " + str(y_test[i]) + "(" + personality_types[y_test[i]] + ")")
print(" nearest Neighbours: ", end =" ")
for neighbour in neighbors[i]:
print(str(y_train[neighbour]) + "(" + personality_types[y_train[neighbour]] + ")", end =" ")
print("\n")
misclassifiedNum +=1
if misclassifiedNum >3:
break
3 misclassified samples and their neighbors will be ready in about 1 MINUTE of execution. Please wait...
3 misclassified examples in model no 8: k=7, non-weighted KNN, no feature normalization:
misclassified sample 1 :
Predicted label: 11(ENFP)
Actual label: 8(ESTP)
nearest Neighbours: 11(ENFP) 11(ENFP) 11(ENFP) 11(ENFP) 11(ENFP) 11(ENFP) 11(ENFP)
misclassified sample 2 :
Predicted label: 9(ESFP)
Actual label: 2(ESFJ)
nearest Neighbours: 9(ESFP) 9(ESFP) 9(ESFP) 9(ESFP) 9(ESFP) 9(ESFP) 9(ESFP)
misclassified sample 3 :
Predicted label: 4(ISTJ)
Actual label: 14(INTP)
nearest Neighbours: 4(ISTJ) 14(INTP) 4(ISTJ) 14(INTP) 4(ISTJ) 4(ISTJ) 14(INTP)
<img src="