cudaKDTree - A Library for Building and Querying Left-Balanced (point-)kd-Trees in CUDA

Left-balanced kd-trees can be used to store and query k-dimensional data points; their main advantage over other data structures is that they can be stored without any pointers or other admin data - which makes them very useful for large data sets, and/or where you want to be able to predict how much memory you are going to use.

This repo contains CUDA code two kinds of operations: building such trees, and querying them.

Building Left-balanced KD-Trees

The main builder provide dy this repo is one for those that are left-balanced, and where the split dimension in each level of the tree is chosen in a round-robin manner; i.e., for a builder over float3 data, the root would split in x coordinate, the next level in y, then z, then the fourth level is back to x, etc. I also have a builder where the split dimension gets chosen based on the widest extent of the given subtree, but that one isn't included yet - let me know if you need it.

The builder is templated over the type of data points; to use it, for example, on float3 data, use the following

cukd::buildTree<float3,float,3>(points[],numPoints);

To do the same on float4 data, use

cukd::buildTree<float4,float,4>(points[],numPoints);

More interestingly, if you want to use a data type where you have three float coordinates per point, and one extra 32-bit value as "payload", you can, for example, use a float4 for that point type, store each point's payload value in its float4::w member, and then build as follows:

cukd::buildTree<float4,float,3>(points[],numPoints);

In this case, the builder known that the structs provided by the user are float4, but that the actual points are only three floats.

The builder included in this repo makes use of thrust for sorting; it runs entirely on the GPU, with complexity O(N log^2 N) (and parallel complexity O(N/k log^2 N), where K is the number of processors/cores); it also needs only one int per data point in temp data storage during build (plus however much thrust::sort is using, which is out of my control).

Stack-Free Traversal and Querying

This repo also contains a stack-free traversal code for doing "shrinking-radius range-queries" (i.e., radius range queries where the radius can shrink during traversal). This traversal code is used in two examples: fcp (for find-closst-point) and knn (for k-nearest neighbors).

For the fct example, you can, for example (assuming that points[] and numPoints describe a balanced kd-tree that was built as described abvoe), be done as such

__global__ void myKernel(float3 *points, int numPoints, ... ) {
   ...
   float3 queryPoint = ...;
   int idOfClosestPoint = fcp(queryPoints,points,numPoints)
   ...

Similarly, a knn query for k=4 elements can be done via

 cukd::FixedCandidateList<4> closest(maxRadius);
 float sqrDistOfFuthestOneInClosest
    = cukd::knn(closest,queryPoints,points,numPoints));

... or for a large number of, for example, k=50 via

 cukd::HeapCandidateList<50> closest(maxRadius);
 float sqrDistOfFuthestOneInClosest
    = cukd::knn(closest,queryPoints,points,numPoints));

As shown in the last two examples, the knn code can be templated over a "container" used for storing the k-nearest points. One such container provided by this library is the FixedCandidateList, which implements a linear, sorted priority queue in which insertion cost is linear in the number of elements stored in the list, but where all list elements should end up in registers, without having to go to gmem; use this for small ks. Alternatively, the HeapCandidateList organizes the closest k in a heap that has O(log k) insertion cost (vs O(k) for the fixed one), but which gets register-indirectly accessed and will thus generate some gmem traffic; use this for larger k where the savings from the O(log k) will actually pay off. Also note that in the fixed list all k elements will always be stored in ascending order, in the heap list this is not the case.