Vincent Garcia
Eric Debreuve
Michel Barlaud
* V. Garcia and E. Debreuve and F. Nielsen and M. Barlaud.
k-nearest neighbor search: fast GPU-based implementations and application to high-dimensional feature matching.
In Proceedings of the IEEE International Conference on Image Processing (ICIP), Hong Kong, China, September 2010
* V. Garcia and E. Debreuve and M. Barlaud.
Fast k nearest neighbor search using GPU.
In Proceedings of the CVPR Workshop on Computer Vision on GPU, Anchorage, Alaska, USA, June 2008.
* Vincent Garcia
Ph.D. Thesis: Suivi d'objets d'intérêt dans une séquence d'images : des points saillants aux mesures statistiques
Université de Nice - Sophia Antipolis, Sophia Antipolis, France, December 2008
- The computer must have a CUDA-enabled graphic card (c.f. NVIDIA website)
- CUDA has to be installed (CUDA drivers and CUDA toolkit)
- A C compiler has to be installed
- For using the CUDA code in Matlab, please refer to the CUDA Matlab plug-in webpage for requirements.
The provided code can be used for C and Matlab applications.
We provide bellow the C and Matlab procedures to compile and execute our code.
The user must have a basic knowledge of compiling and executing standard examples before trying to compile and execute our code.
We will consider here that we want to compile the file "knn_cuda_with_indexes.cu".
For C
1. Set the global variable MATLAB_CODE to 0 in file knn_cuda_with_indexes.cu
2. Compile the CUDA file with the command line:
nvcc -o knn_cuda_with_indexes.exe knn_cuda_with_indexes.cu -lcuda -D_CRT_SECURE_NO_DEPRECATE
3. Execute the knn_cuda_with_indexes.exe with the command line:
./knn_cuda_with_indexes.exe
For MATLAB
1. Set the global variable MATLAB_CODE to 1 in file knn_cuda_with_indexes.cu
2. Compile the CUDA file with the Matlab command line:
nvmex -f nvmexopts.bat knn_cuda_with_indexes.cu -I'C:\CUDA\include' -L'C:\CUDA\lib' -lcufft -lcudart -lcuda -D_CRT_SECURE_NO_DEPRECATE
3. Execute the run_matlab.m script
In CUDA, it is usual to use the notion of array.
For our kNN search program, the following array
A = | 1 3 5 |
| 2 4 6 |
corresponds to the a set of 3 points of dimension 2:
p1 = (1, 2)
p2 = (3, 4)
p3 = (5, 6)
The array A is actually stored in memory as a linear vector:
A = (1, 3, 5, 2, 4, 6)
The organisation of data is different in Matlab and in CUDA. For Matlab, the previous linear vector
corresponds to an array of 3 lines and 2 columns:
| 1 2 |
A = | 3 4 |
| 5 6 |
CUDA
Number of reference points : 4096
Number of query points : 4096
Dimension of points : 32
Number of neighbors to consider : 20
Processing kNN search : done in 15.662722 s for 100 iterations (0.156627 s by iteration)
cuBLAS
Number of reference points : 4096
Number of query points : 4096
Dimension of points : 32
Number of neighbors to consider : 20
Processing kNN search : done in 12.456899 s for 100 iterations (0.124569 s by iteration)
Time(%) Time Calls Avg Min Max Name
78.15% 1.23364s 100 12.336ms 12.279ms 12.389ms cuInsertionSort(float*, int, int*, int, int, in
t, int)
20.09% 317.18ms 100 3.1718ms 3.1694ms 3.1745ms cuComputeDistanceTexture(int, float*, int, int,
int, float*)
0.63% 10.023ms 200 50.116us 49.761us 57.730us [CUDA memcpy DtoH]
0.55% 8.6914ms 100 86.914us 86.466us 91.651us [CUDA memcpy HtoA]
0.55% 8.6361ms 100 86.361us 85.794us 92.802us [CUDA memcpy HtoD]
0.02% 382.09us 100 3.8200us 3.6480us 4.0960us cuParallelSqrt(float*, int, int, int)
==30641== Profiling result:
Time(%) Time Calls Avg Min Max Name
88.46% 1.19670s 100 11.967ms 11.921ms 11.993ms cuInsertionSort(float*, int, int*, int, int, in
t, int)
9.49% 128.35ms 100 1.2835ms 1.2822ms 1.2846ms cuComputeDistanceGlobal(float*, int, int, float
*, int, int, int, float*)
1.28% 17.344ms 200 86.717us 86.306us 93.058us [CUDA memcpy HtoD]
0.74% 10.022ms 200 50.110us 49.761us 61.282us [CUDA memcpy DtoH]
0.03% 380.90us 100 3.8080us 3.6160us 4.1280us cuParallelSqrt(float*, int, int, int)
Time(%) Time Calls Avg Min Max Name
89.75% 1.21083s 100 12.108ms 12.054ms 12.166ms cuInsertionSort(float*, int, int*, int, int, in
t, int)
4.50% 60.706ms 100 607.06us 598.16us 612.56us cuAddRNorm(float*, int, int, int, float*)
2.64% 35.649ms 100 356.49us 342.28us 375.15us maxwell_sgemm_128x128_raggedMn_nt
1.29% 17.369ms 201 86.413us 1.2160us 92.674us [CUDA memcpy HtoD]
1.02% 13.788ms 100 137.88us 137.73us 138.05us cuAddQNormAndSqrt(float*, int, int, float*, int
)
0.74% 10.023ms 200 50.115us 49.505us 52.354us [CUDA memcpy DtoH]
0.05% 679.41us 200 3.3970us 3.1360us 3.7440us cuComputeNorm(float*, int, int, int, float*)
More detailed
NVPROF is profiling process 5417, command: ./knn_cublas_with_indexes.exe
Number of reference points : 16384
Number of query points : 16384
Dimension of points : 128
Number of neighbors to consider : 1
Processing kNN search : done in 4.767463 s for 10 iterations (0.476746 s by iteration)
==5417== Profiling application: ./knn_cublas_with_indexes.exe
==5417== Profiling result:
Time(%) Time Calls Avg Min Max Name
33.42% 103.23ms 10 10.323ms 10.291ms 10.350ms cuAddRNorm(float*, int, int, int, float*)
28.21% 87.144ms 10 8.7144ms 8.7070ms 8.7265ms maxwell_sgemm_128x128_raggedMn_nt
22.24% 68.711ms 10 6.8711ms 6.8645ms 6.8830ms cuInsertionSort(float*, int, int*, int, int, int, int)