TSDG is an index graph for graph-based approximate nearest neighbor search, which is fast to build, efficient to search, and GPU friendly.
$ cd tsdg
$ mkdir build && cd build
$ cmake ..
$ make -j
The demos for building index and searching are under the directory examples/
Firstly, we need to prepare a SC-graph.
The SC-graph need to be built in advance, we constructed SC-graph by modified NN-Descent in nndescent.hpp
You can use our example code to build the SC-graph as follows:
$ build/examples/scg_build DATA_PATH K SAMPLE_NUM ITERATION_NUM OUTPUT_PATH
DATA_PATH
is the path of the base data infvecs
format.K
is the number of edges of the SC-Graph.SAMPLE_NUM
controls the number of samples in each NN-Descent iteration.ITERATION_NUM
controls the number of NN-Descent iterations.OUTPUT_PATH
is the path of the generated SC-graph.
An example:
$ build/examples/scg_build \
./data/sift1m/sift1m_base.fvecs \
200 16 10\
./data/sift1m/index/sift1m_scg.ivecs
The following are the parameters used for building the SC-graphs on the CPU in our paper.
Dataset | K |
SAMPLE_NUM |
ITERATION_NUM |
---|---|---|---|
SIFT1M | 200 | 16 | 10 |
DEEP1M | 200 | 16 | 10 |
T2I1M | 400 | 16 | 10 |
SPACEV1M | 200 | 20 | 12 |
GIST1M | 400 | 16 | 12 |
Turing1M | 200 | 20 | 15 |
Note: The above parameter settings consider the balance of building efficiency and search efficiency, appropriately increasing SAMPLE_NUM and ITERATION_NUM may improve the search efficiency.
The above datasets can be downloaded from http://corpus-texmex.irisa.fr/ and https://big-ann-benchmarks.com/
Secondly, we will convert the SC-graph to the TSDG index.
You can use our example code to achieve this conversion as follows:
$ build/examples/tsdg_build DATA_PATH SCG_PATH RELAXED_FACTOR OCCL_THRESHOLD MAX_EDGES OUTPUT_PATH
DATA_PATH
is the path of the base data infvecs
format.SCG_PATH
is the path of the pre-built SC-graph inivecs
format.RELAXED_FACTOR
used for first-stage graph diversification, controls the average degree of index graph.OCCL_THRESHOLD
used for second-stage graph diversification, controls the average degree of index graph.MAX_EDGES
used for limiting the degree of index graph.OUTPUT_PATH
is the path of the generated TSDG index.
An example:
$ build/examples/tsdg_build \
./data/sift1m/sift1m_base.fvecs \
./data/sift1m/sift1m_scg.ivecs \
1.2 4 60 \
./data/sift1m/index/sift1m.tsdg
The parameters RELAXED_FACTOR
in the source code corresponds to the parameter
The parameters OCCL_THRESHOLD
in the source code corresponds to the parameter
The following are the parameters used for building and searching experiments on the CPU.
Dataset | RELAXED_FACTOR |
OCCL_THRESHOLD |
MAX_EDGES |
---|---|---|---|
SIFT1M | 1.2 | 4 | 60 |
DEEP1M | 1.2 | 4 | 60 |
T2I1M | 1.2 | 4 | 60 |
SPACEV1M | 1.2 | 4 | 80 |
GIST1M | 1.1 | 8 | 80 |
Turing1M | 1.2 | 4 | 80 |
Note: The above parameter settings consider the efficiency of graph construction and the consistency of parameters. The search performance is better when the parameter RELAXED_FACTOR of T2I1M and SPACEV1M is 1.3, but it will increase the time of graph construction.
The index graph building time (s) is shown below (including the SC-graph and k-NN graph building time):
Dataset | TSDG⁺ |
NSG |
SSG |
DPG |
HNSW |
---|---|---|---|---|---|
SIFT1M | 307.5 | 492.8 | 359.8 | 462.8 | 364.0 |
DEEP1M | 282.1 | 449.5 | 320.4 | 400.9 | 326.1 |
T2I1M | 595.1 | 1472.4 | 913.9 | 1888.9 | 937.2 |
SPACEV1M | 539.9 | 657.9 | 676.8 | 659.5 | 615.4 |
GIST1M | 1363.7 | 2794.6 | 1728.0 | 6138.7 | 1595.9 |
Turing1M | 726.5 | 1267.8 | 1052.9 | 827.6 | 792.7 |
You can use our demo code to search on graph as follows:
build/examples/tsdg_search INDEX_PATH QUERY_PATH GROUND_TRUTH_PATH
An example:
$ build/examples/tsdg_search \
./data/sift1m/index/sift1m.tsdg \
./data/sift1m/sift1m_query.fvecs \
./data/sift1m/sift1m_gt.ivecs
The performance of the search on the graph is shown below:
Due to licensing issues, this library is not able to provide the code on the GPU, the GPU related code will be open source in the future.