An Exact K-Nearest Neighbours Optimisation Study Project Using Parallel Programming on CUDA
Cuda
Enhancing k-Nearest Neighbours Efficiency with 32-Ary Heaps on GPU
The k-Nearest Neighbours (kNN) algorithm is a fundamental tool in supervised learning, relying on proximity to classify or predict data points. In this study, I address the challenges of kNN, particularly the computational complexity associated with distance calculations and comparisons. I introduce a heap structure to efficiently maintain the $k$ nearest neighbours for each data point. To be specific, I employ a max heap structure, ensuring that only the $k$ nearest neighbours are retained in the tree for every data point. To further enhance efficiency, I implement a 32-ary heap, leveraging the parallel capabilities of GPUs. While a binary heap is commonly used for its efficiency in a sequential environment, my CUDA implementation benefits from warp-wide shuffle operations, allowing it to find the maximum child node in $O(\log_2(w))$ time, where $w$ is the warp size. The final data structure is a 32-ary heap, tailored for NVIDIA GPUs with a warp size of 32. My work contributes to the optimisation of the kNN algorithm, providing a comprehensive exploration of heap structures on a GPU. I emphasise exact kNN computation, offering insights into the advantages of parallelisation and heap structures in the context of large-scale data processing.
The k-Nearest Neighbours (kNN) Problem and Objectives
kNN is a classic algorithm in supervised learning, using proximity to classify or predict the grouping of a data point. The algorithm's intuition is relatively simple: to find $k$ nearest data points according to the queried data point (where $k$ is a model hyperparameter). From here, we could easily point out a critical challenge for this approach, which is the complexity of calculating distances, storing, and comparing them. For the sake of study, I work on this project with some objectives in mind:
Optimising the time complexity as much as I can, with access to a sufficiently large memory available (4GB RAM) and
Structuring the project in an object-oriented manner, embracing modularisation for usability.
Some approaches utilise approximation methods to trade off a little bit of accuracy with time efficiency [1-3]. To be clear, my project is not about those. In this study, I mainly focus on the optimisation problem by addressing the exact kNN algorithm.
Heap Structure on the kNN Problem
A heap is a tree-based data structure that satisfies the heap property: For any given node $C$, if $P$ is a parent node of $C$ then the value of $P$ is greater than or equal to the value of $C$ (also known as a max heap). The idea here is to only keep at most $k$ nodes in the tree at any time for each data point, i.e. the $k$ nearest neighbours to that data point. Therefore, I employed a max heap structure on each data point's nearest neighbours array. When a new pair of data points $(i, j)$ is considered, if the calculated distance is larger than the root node value of $i$'s heap, we could ignore it. Otherwise, we remove the root node with $j$ and the distance, then perform the down heap procedure. The same process is applied for $j$'s heap.
32-Ary Heap
In common, people utilise the binary heap structure as it has the most efficient time complexity. Considering two operations performed the most in a heap structure, which are moving a node up or down the heap and finding the maximum child node of a node, if we use an "unary" structure, it basically is an array structure and the moving operations takes $O(N)$ time, where $N$ is the size of the heap. If we increase the number of children available for a node, say a "ternary" or "quaternary" structure, the height of the tree might be decreased to $\log_3(N)$ or $\log_4(N)$ but the time complexity for finding the maximum child node is increased. Generally, a $d$-ary heap will take $O(\log_d(N))$ time to move all the way up and down the tree and $O(d)$ time to find the maximum child node. Combining those gives us the worst case time complexity of $O(d\log_d(N))$. Solving that brings us the best solution of $d$ is $2$.
However, this is only true in a sequential environment. Thanks to warp-wide shuffle operations, we are able to find the maximum child node in only $O(\log_2(w))$ time, where $w$ is the warp size. This is achieved by halving the problem in a parallel manner, leading us to the total complexity on $w$-ary heaps in a parallel setup of $O(\log_2(w)\log_w(N))$. The warp size is dependent on the GPU architecture. Here, I used an NVIDIA GPU, which has the warp size of 32, making the final data structure a 32-ary heap.
References
Alexandr Andoni and Piotr Indyk. 2006. Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS '06). IEEE Computer Society, USA, 459–468. https://doi.org/10.1109/FOCS.2006.49
Alexandr Andoni and Ilya Razenshteyn. 2015. Optimal Data-Dependent Hashing for Approximate Near Neighbors. In Proceedings of the forty-seventh annual ACM symposium on Theory of Computing (STOC '15). Association for Computing Machinery, New York, NY, USA, 793–801. https://doi.org/10.1145/2746539.2746553