NOTE: Attached you can see the 'knn.py' file with the knn functions from scratch. The 'kNN_example.ipynb' file has an example with this implementation.
k-Nearest Neighbors is a very commonly used algorithm for classification. It works great when you have large amount of classes and a few samples per class, this is why it is very commonly used in face recognition.
kNN in one sentence: is an algorithm that classifies and assigns labels based on the closest k-neighbors.
k Parameter - Size of Neighborhood
- k represents the amount of neighbors to compare data with. That is why it usually k is an odd number.
- the bigger the k, the less 'defined' or more smooth are the areas of classification.
Distance is a key factor in order to determine who is the closest. Distance impacts the size and characteristics of the neighborhoods. The most commonly used is Euclidean distance since it gives the closest distance between 2 points.
Most Common Distances
- Euclidean: the shortest distance between to points that might not be the best option when features are normalized. Typically used in face recognition.
- Taxicab or Manhattan: is the sum of the absolute differences of the Cartesian coordinates of 2 points. It works the same way as when a car needs to move around 'blocks' to get to the destination.
- Minkowski: is a mix of both Euclidean and Mincowski.
The amount of features impacts kNN significantly because the more points we have, the more 'unique' each neighborhood becomes. It also affects speed because we need to measure each distance first in order to determine who are the closest k neighbors.