CUDA/Tensorflow KD-Tree K-Nearest Neighbor Operator

This repository implements two different custom KNN algorithms: A simple, yet memory efficient exhaustive search with quadratic runtime but linear memory. Secondly, for multiple queries of the same point cloud, a custom KD-Tree operator in CUDA (GPU), or C++ (CPU). The KD-Tree is always generated using the CPU, but is automatically transferred to the GPU for Tensorflow operations there. The KD-Tree implementation has logarithmic runtime and will quickly

The algorithms' dimensions are currently defined through template parameters and must be known at compile-time therefore. The present version compiles the library for the dimensionalities 1, 2, 3.

Usage Examples

C++

int main()
{
    std::vector<test_T> point_cloud; //KD-Tree points
    std::vector<test_T> query_points; //Query points

    //Fill point clouds (will have a size of N*dims and M*dims)
    //...

    //Create KD-Tree
    //Note that the data needs to be present at the CPU
    auto partition_info = createKDTree<test_T, dims>(point_cloud.data(), nr_points, levels);

    //CPU
    std::vector<test_T> result_dists(nr_query_points*knn);
    std::vector<point_i_t> result_idx(nr_query_points*knn);
    KDTreeKNNSearch<test_T, test_T, dims>(partition_info, 
                    nr_query_points, reinterpret_cast<std::array<test_T, dims>*>(query_points.data()), result_dists.data(), result_idx.data(), knn, levels);

    //GPU
    PartitionInfoDevice<test_T, dims>* partition_info_d = copyPartitionToGPU<test_T, dims>(partition_info);
    const auto tmp = copyData<test_T, dims>(result_dists, result_idx, std::vector<std::array<test_T, dims>>(reinterpret_cast<std::array<test_T, dims>*>(query_points.data()), 
                                                                reinterpret_cast<std::array<test_T, dims>*>(query_points.data()) + query_points.size() / dims));
    test_T* result_dists_d = std::get<0>(tmp);
    point_i_t* result_idx_d = std::get<1>(tmp);
    test_T* query_points_d  = std::get<2>(tmp);

    KDTreeKNNGPUSearch<test_T, test_T, dims>(partition_info_d, nr_query_points, 
        reinterpret_cast<std::array<test_T, dims>*>(query_points_d), result_dists_d, result_idx_d, knn, levels);

    auto result_gpu = copyDataBackToHost(result_dists_d, result_idx_d, nr_query_points, knn);
    const auto result_dists_gpu = std::get<0>(result_gpu);
    const auto result_idx_gpu = std::get<1>(result_gpu);
}

Tensorflow

Graph mode:

import numpy as np
import tensorflow as tf
from tf_nearest_neighbor import nn_distance, buildKDTree, searchKDTree

nr_refs = 1000 #Points in the KD-tree
nr_query = 100 #Points to be queried from the tree
d = 3 # Dimensions

#Create some random point clouds
points_ref = np.random.uniform(size=(nr_refs, d)).astype(np.float32) * 1e3
points_query = np.random.uniform(size=(nr_query, d)).astype(np.float32) * 1e3

#Search for the 5 nearest neighbors of each point in points_query
k = 5

points_ref_tf = tf.compat.v1.placeholder(dtype=tf.float32, shape=points_ref.shape)
points_query_tf = tf.compat.v1.placeholder(dtype=tf.float32, shape=points_query.shape)

#Build the KD-Tree in the GPU memory
build_kdtree_op = buildKDTree(points_ref_tf, levels=None) #Use maximum available levels

#Some metainformation. The important variable is part_nr which identifies the tree
structured_points, part_nr, shuffled_inds = self.sess.run(build_kdtree_op, feed_dict={points_ref_tf: points_ref})

#Search for the 5 nearest neighbors of each point in points_query
kdtree_results = searchKDTree(points_query_tf, part_nr[0], nr_nns_searches=k, shuffled_inds=shuffled_inds.astype(np.int32))
dists_knn, inds_knn = self.sess.run(kdtree_results, feed_dict={points_query_tf: points_query})

or even shorter in eager mode:

import numpy as np
import tensorflow as tf
from tf_nearest_neighbor import nn_distance, buildKDTree, searchKDTree

nr_refs = 1000 #Points in the KD-tree
nr_query = 100 #Points to be queried from the tree
d = 3 # Dimensions

#Create some random point clouds
points_ref = np.random.uniform(size=(nr_refs, d)).astype(np.float32) * 1e3
points_query = np.random.uniform(size=(nr_query, d)).astype(np.float32) * 1e3

#Build the KD-Tree in the GPU memory
structured_points, part_nr, shuffled_inds = buildKDTree(points_ref, levels=None) #Use maximum available levels

#Query the KD-Tree, for the 5 nearest neighbors of each point in points_query
dists_knn, inds_knn = searchKDTree(points_query, part_nr[0], nr_nns_searches=5, shuffled_inds=shuffled_inds)

Installation

Prerequisites

Tested on Ubuntu 18.04. Exact minimal version numbers are unknown, but here are the ones with which the library was tested.

Build:

CMake 3.13.4
g++ 7.5.0
Cuda 10.1

Python Interface:

Tensorflow >= 2.0
Numpy 1.19.2
python 3.6

For the benchmark and tests:

IPython 7.16.1
Matplotlib 3.3.2
scipy 1.5.2

Note that if you intend to install this library with an existing Tensorflow installation, you need to make sure that the CUDA and CUDNN versions match.

Commands

Inside this directory run

mkdir src/build
cd src/build
cmake .. -DCMAKE_BUILD_TYPE=RELEASE
make all

CMake will fetch the library paths from the currently active python environment. This should build the library and place the library in the main directory, where it will be loaded upon importing this library.

Docker

A Dockerfile is provided that downloads the required libraries and should allow for easy testing. Docker container commands:

docker build . -t tf_nndistance_env
docker run -it --gpus all tf_nndistance_env /bin/bash

Inside docker:

conda activate tf_nndistance_env
cd /tf_nearest_neighbor/build
./test_kdtree && ipython ../../scripts/test_knn_unit.py

Tests

The library contains a small C++ example that will perform a small simple test. It is run by calling ./test_kdtree in the build directory. For

Benchmark

We compared both implementations to the scipy.spatial.cKDTree. Note that the benchmarks do not consider the time to build the KD-Trees, or the transfer to the GPU. Times greater than 1 second not shown.

Test Machine Specs: Intel Core i7-5820K CPU 6x @3.30GHz, 32GB of working memory and a NVidia RTX 2080 GPU.

To run the benchmark on your computer, go to the folder scripts and execute ipython benchmark.py. This will create benchmark_results.npz that can be converted to a figure using ipython plot_benchmark.py.

Acknowledgements

If this works helps you in your research, please consider acknowledging the github repository, or citing our paper from which the library originated.

@article{grandits_geasi_2021,
	title = {{GEASI}: {Geodesic}-based {Earliest} {Activation} {Sites} {Identification} in cardiac models},
	volume = {37},
	issn = {2040-7947},
	shorttitle = {{GEASI}},
	url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/cnm.3505},
	doi = {10.1002/cnm.3505},
	language = {en},
	number = {8},
	urldate = {2021-08-12},
	journal = {International Journal for Numerical Methods in Biomedical Engineering},
	author = {Grandits, Thomas and Effland, Alexander and Pock, Thomas and Krause, Rolf and Plank, Gernot and Pezzuto, Simone},
	year = {2021}
}

thomgrand/tf_kdtree