In this repository, we share the implementations and experiments of our work Starling: An I/O-Efficient Disk-Resident Graph Index Framework for High-Dimensional Vector Similarity Search on Data Segment (arXiv).
It contains the following features:
For build,
- Build disk graph.
- Build in-memory navigation graph, based on
- Nodes that are uniformly-sampled.
- Nodes that are generated by search frequency.
- Perform Graph Partition on given base data
For search,
With Cache Nodes | With Nav Graph | With Graph Partition | With use_ratio |
Use SQ | |
---|---|---|---|---|---|
Beam Search | ✅ | ✅ | |||
Page Search | ✅ | ✅ | ✅ | ✅ | ✅ |
The datasets we used in the experiments can be downloaded and the data formats are explained in NeurIPS'21 Big-ANN Benchmark.
Dataset | Data type | Dimensions | Distance | # Query | Query type |
---|---|---|---|---|---|
BIGANN | uint8 | 128 | L2 | 10000 | ANNS/RS |
DEEP | float | 96 | L2 | 10000 | ANNS/RS |
SSNPP | uint8 | 256 | L2 | 100000 | RS |
Text2image | float | 200 | IP | 100000 | ANNS |
To install dependencies, run
apt install build-essential libboost-all-dev make cmake g++ libaio-dev libgoogle-perftools-dev clang-format libboost-all-dev libmkl-full-dev
To run benchmarks, go to scripts
directory, copy config_sample.sh
to config_local.sh
, modifies the datasets paths in config_dataset.sh
and run
./run_benchmark.sh [debug/release] [build/build_mem/freq/gp/search] [knn/range]
Arguement | Description |
---|---|
debug/release |
Debug/Release mode to run, passed to CMake |
build |
Build index |
build_mem |
Build memory index |
freq |
Generate visit-frequency file |
gp |
Graph partition given index file |
search |
Search index |
knn |
Find k-nearest neighbors |
range |
Range search |
Configure datasets and parameters in config_local.sh
To run experiments with multiple segments and large-scale, please visist https://github.com/PwzXxm/segment-framework.