The final version is inside the "Completed_version" folder
Here I import a dataset, preprocess the data and then try to find the best combination of features and KNN neighbor numbers.
The goal of this code is to implement the cross-validation method to find the most important features and the best parameter of a model.
First is preparing the data, i.e., dealing with missing values, sorting them based on their relative standard deviation, and scaling the data. Next step is explained by this pseudo code:
Let F total number of selected features Let K total number of folds Let N total number of neighbors for KNN implementation
- Prepare the data
- a. Read the data and variables
- b. Shuffle rows
- c. Split the data to features and target datasets
- d. Sort the features based on importance, i.e. relative standard deviation
- e. Eliminate features based on their correlations, i.e. Unsupervised feature selection
- f. Scale features
- For f from 1 to F
- a. For n from 1 to N
- i. Split the data pseudo randomly into K folds
- ii. For k from 1 to K
- Set the k’th fold as test dataset and merge all other folds and set to train dataset
- Implement KNN with n neighbor
- Store the KNN accuracy in a 2D matrix
- Calculated the average score by dividing the matrix by K
- Find the maximum score and return its indices as best number of features and neighbors
1- Create an empty 3-D array to store folds 2- Divide the data based on the class in to 2 classes, A and B a. For s from 1 to size of class A: i. For k from 1 to number of folds
- Send Class A instances, s, to k’th fold b. For s from 1 to size of class B
- Send Class B instances, s, to k’th fold
1- Merge all the folds but the selected one
2- Drop all NaN values
3- Return the selected fold as the test set and the merged folds as the training set
To compare performances between different selections of features and KNN parameters, we have used the accuracy given by KNN. We implemented Scikit-Learn function and used its score. The scores for all folds are added, and then the average is calculated and considered for comparison. All the accuracies are stored in a matrix and can be used to draw 2-D and 3-D plots.