Supervised-ML-Algorithms: A Jupyter Notebook repository from Amogha027

Author: Amogha A Halhalli

Roll No: 2021101007

K Nearest Neighbours

Loaded the dataset from data.npy file and analysed through the data.

Implemented a graph that shows the distribution of the various labels across the entire dataset using Matplotlib.

Created a class for KNN which takes train data, k value, encoder type and distance metrics as parameters.
Split the entire dataset into train data and test data with train size=0.8
Implemented the set methods to modify the value of k, encoder type and distance metrics.
Calculated the f-1 score, accuracy, precision, and recall using sklearn metrics.
Used average='weighted' and zero_division=0 to calculate f-1 score, precision and recall.

Found out the best triplet (k, encoder type, distance metric) by recursing through all the triplets possible.
Sorted the triplets based on the accuracy and found the list of top 20 such triplets.
Plotted the k vs accuracy graph with VIT as encoder type and manhattan as the distance metric.
Used the standard library Matplotlib to construct the plots.

Created a bash script that takes the path of a test file as first arguement for testing.
It prints the accuracy, f1-score, recall and precision of the test data in a table.
Assuming the train and test file contain the data in the same pattern as in data.npy file.
The path to the train file can also be stated as the second argument for the bash script.
If second arguement is not stated, it assumes train file as the existing data.npy in the current directory.
The bash script in turn runs the check.py file to calculate the scores. Do not move or modify the check.py file.
Proper error handling is done for any of the absence files or wrong names of the files.
Using k=7, VIT as encoder type and manhattan as distance metric in this bash script.

Initial time Complexity is O(1) for training and O(Nd+NlogN) for testing.
Then used heap, while testing, in order to minimise the overall time complexity of the algorithm.
Improved the execution time of the program by using vectorization done by implementing numpy arrays.
Initial KNN model and the most optimized KNN model are the same initial implementation of the algorithm.
Best KNN model is the model which has maximum accuracy and most optimized KNN model is the one which takes least time to run.

Data visualization and exploration
Throughly went through all the attributes of the given dataset.
Found out the number of unique labels and the attributes which should be encoded.
Data preprocessing
Used multi-label binarizer to encode the labels.
Used one-hot encoding to encode the categorial variables.
Data featurization
Found out the city attribute has almost unique values in each sample.
Attributes as such can be dropped in order to avoid the overhead of many features created by one-hot encoding.
Train-test splitting
Initially split the entire data into X and Y(labels).
Then, split each X and Y into train data and test data with train size=0.8

Loaded the dataset form the provided file advertisement.csv using pandas.

Created a class for the Decision Tree which takes criterion, max depth and max features as parameters.
Implemented the set methods to modify the value of criterion, max depth and max features.
Used the inbuilt sklearn decision tree in order to build the Decision Tree Classifier.
Implemented the Powerset Formulation using the LabelPowerset function.
Implemented the MultiOutput Formulation using the MultiOutputClassifier function.

Reported Accuracy, F1(micro and macro), Precision and Recall scores for all possible triplet of hyperparamters for both Powerset and MultiOutput Formulation.
The files powerset.txt and multioutput.txt files contains the above scores for the corresponding Powerset formulation and Multioutput formulation.
Implemented the pooled confusion matrix in order to avoid multiple matrixes.
Using Hamming loss to calculate the accuracy instead of accuracy_score.
Ranked the top 3 performing set of hyperparamters according to F1-Score(macro) for both Powerset and MultiOutput Formulation.
Implemented the K fold validation metrics with the value of K being 8.