/SRS

SRS - Fast Approximate Nearest Neighbor Search in High Dimensional Euclidean Space With a Tiny Index

Primary LanguageC++

SRS - Fast Approximate Nearest Neighbor Search in High Dimensional Euclidean Space With a Tiny Index

SRS-Mem is a C++ program for performing approximate nearest neighbor search in high dimension Euclidean space in the main memory. The current implementation is adapted from our VLDB'15 research paper. The main modification is to use an in-memory multidimensional index (rather than an R-tree as in the paper), as it is often the case that our index is so small and can be accommodated in the main memory. Currently, the index is a modified version of the Cover Tree due to its strong theoretical guarantees; nevertheless, any multidimensional index that supports incremental exact kNN search can be used!

Features

  • Guaranteed Success Probability

    Theoretically, SRS guarantees to return a c-approximate nearest neighbor to the query with a user specified probability even in the worst case. For example, many heuristic methods will not return a near neighbor on some hard datasets (e.g., those generated by gen_hard_data).

    There are several other unique theoretical properties of the SRS algorithm. The top-k version of SRS guarantees to return c-k-approximate nearest neighbors with constant probability (while previous methods have no guarantee for k > 1), and the SRS-1 algorithm guarantees to return nearest neighbor (i.e., c = 1) to the query with any user specified success probability.

  • Small Index Size

    The index size of SRS is substantially smaller than the size of original data. For example, the index size for a 12GB data set (8 million 384-dimension points) is only 337MB. This means that the SRS index usually can be accommodated entirely in the main memory.

    As a side note, our index and query processing algorithm is independent of the dimensionality of the dataset, i.e., it works for arbitrarily high dimensions.

  • Rich Functionality

    The users can easily achieve a space-time balance by tuning parameters even after the index has been built. All four variants of the query processing algorithm in the paper are supported.

How it works

In a nutshell, SRS reduces the problem of "approximate NN search in high dimensional space" to a "exact T-NN search in a low dimensional space".

In the indexing phase, SRS projects data points from the original high-dimensional space into an appropriately chosen m-dimensional space via 2-stable random projections, and then uses a cover-tree to index these projected points.

The key observation is that the inter-point distance in the projected space (called Projected Distance) over that in the original high-dimensional space follows a scaled chi-squared distribution, which has a sharp concentration bound. Therefore, given any threshold on the projected distance, and for any point o, we can compute exactly the probability that o's projected distance is within the threshold.

In the querying phase, SRS performs an incrementally k-NN search on the projected space, until when it has found a satisfactory point (i.e., early-termination condition), or it has examined t * n points (i.e., normal-termination condition).

Before start

  • The users need to have Boost C++ library installed (http://www.boost.org/). The Boost library is used to calculate the quantile of chi-squared distribution.
  • Currently the program has only been tested on Linux.
  • There are four key parameters to the SRS algorithms:
    • n: number of data points.
    • c: approximation ratio.
    • m: number of 2-stable random projections to be used in the index.
    • prob_thres: the probability that the algorithm returns a c-approximate NN. Typically, n is fixed, and one can fine tune the other three parameters to achieve different space/time/quality tradoffs. Fixing any of the two parameters will place a constraint on the third parameter.
  • In addition to these input parameters, the program also generate an internal parameter t: the fraction of data points to be examined before the search terminates in the normal condition.

How to use SRS

  1. Compile the program:
% make all
  1. Use cal_param to calculate a feasible setting of parameters based on the given constraints. The users can manually set either m or the success probability. A feasible setting of parameters will be printed on the screen and it can be used later in the query processing phase. The implementation is based on Algorithm 6 in the paper. For example (using the toy dataset with 3000 points):
% ./cal_param -n 3000 -m 7 -c 4

The following message will be printed out:

A feasible setting is:
m = 7
prob_thres(-r) = 0.299203
T_max(-t) = 2
t = 0.000544
The output indicates that, in the query processing phase, the users shall
use the arguments `-c 4`, `-m 7`, `-r 0.299203` and `-t 2`.

As a rule of thumb, we recommend setting m between 6 and 10, to begin with.

  1. Use gen_gt to generate the ground truth of given dataset and query workload. The ground truth file will be used when processing the query workload. For example (using the toy dataset):
% ./gen_gt -d 192 -n 3000 -k 10 -q data/toy.q -s data/toy.ds -g data/toy.gt -y i

-y i indicates that each coordinates is an integer. The other option is -y f, indicating that each coordinates is a floating point number.

  1. Use srs with the -I option to index the data. Users need to specify m and index path in this step. For example (using the toy dataset):
% mkdir index
% ./srs -I -d 192 -i index/ -m 7 -n 3000 -s data/toy.ds -y i
  1. Use srs with the -Q option to process the query workload. The implementation is based on Algorithm 1 in the paper and its variants. The top-k approximate nearest neighbors for each query in the query workload will be returned, together with the average ratio and time over all queries.

    Users can use the parameter setting given by cal_param. For example (using the toy dataset):

% ./srs -Q -c 4 -g data/toy.gt -i index/ -k 10 -q data/toy.q -t 2 -r 0.299203
Alternatively, users can also specify the parameters by themselves to achieve another
space-time trade-off.
  1. The users can change the -t parameter to n (i.e., the cardinality of the dataset) to force the algorithm rely on the early-termination condition to stop. This will increase the quality and slightly increase the time cost. This is the SRS-2 algorithm in the paper.

  2. The users can change the -r parameter to any number larger than 1, to force the algorithm stop only on the normal-termination condition (i.e., examining tn data points). This will substantially increase the quality and time cost. This is the SRS-1 algorithm in the paper.

  3. The users can change the -c parameter to a smaller value to achieve better quality without affecting the worst case time cost. This is the SRS-12+ algorithm in the paper.

Data Format

  1. Data file should contain n lines, where n is the cardinality of the dataset. The file should be formatted as:
e_1_1 e_1_2 ... e_1_d
...
e_n_1 e_n_2 ... e_n_d

where e_i_js are integers, and are separated by whitespace.

  1. Query file should contain N+1 lines, where N is the number of queries in the query workload. The file should be formatted as:
N d
ID_1 e_1_1 e_1_2 ... e_1_d
...
ID_N e_N_1 e_N_2 ... e_N_d

where d is the dimensionality, e_i_j is an integer, and separated by whitespace.

Hard Dataset

  • Users can use gen_hard_data to generate a hard dataset with a user specified cardinality, dimensionality and approximation ratio. The dataset contains one point which is the nearest neighbor of query. All the other points are (c+ε)-NN of query (c is user specified approximation ratio), and these points are distributed randomly and uniformly on a sphere centered at the query point with radius c+ε.

  • An example of using gen_hard_data:

    % ./gen_hard_data -n 1000000 -d 128 -c 4 -s hard.ds -q hard.q
    

    Then a dataset contains 1,000,000 points (i.e., hard.ds) and a query set contains 1 query (i.e., hard.q) will be generated.

  • Users can repeat indexing and query processing using different random seeds (use the -e parameter to specify different random seeds during the indexing phase) to get empirical success probability of SRS.

    Users can also run the example script run_hard_data:

    % sh run_hard_data.sh
    

Condition of use

  • SRS is distributed under the terms of the GPL License.
  • Copyright is owned by DBWang Group, University of New South Wales, Australia.

Future work

  • Support more input data formats.
  • Integrate SRS with other multidimensional indexing methods.

Contact

Please report bugs, feature requests, comments, or suggestions to Yifang Sun (yifangs AT cse.unsw.edu.au) or Wei Wang (weiw AT cse.unsw.edu.au).