/starling

Primary LanguageC++OtherNOASSERTION

Starling

In this repository, we share the implementations and experiments of our work Starling: An I/O-Efficient Disk-Resident Graph Index Framework for High-Dimensional Vector Similarity Search on Data Segment (arXiv).

It contains the following features:

For build,

  1. Build disk graph.
  2. Build in-memory navigation graph, based on
    1. Nodes that are uniformly-sampled.
    2. Nodes that are generated by search frequency.
  3. Perform Graph Partition on given base data

For search,

With Cache Nodes With Nav Graph With Graph Partition With use_ratio Use SQ
Beam Search
Page Search

Datasets

The datasets we used in the experiments can be downloaded and the data formats are explained in NeurIPS'21 Big-ANN Benchmark.

Dataset Data type Dimensions Distance # Query Query type
BIGANN uint8 128 L2 10000 ANNS/RS
DEEP float 96 L2 10000 ANNS/RS
SSNPP uint8 256 L2 100000 RS
Text2image float 200 IP 100000 ANNS

Quick Start

To install dependencies, run

apt install build-essential libboost-all-dev make cmake g++ libaio-dev libgoogle-perftools-dev clang-format libboost-all-dev libmkl-full-dev

To run benchmarks, go to scripts directory, copy config_sample.sh to config_local.sh, modifies the datasets paths in config_dataset.sh and run

./run_benchmark.sh [debug/release] [build/build_mem/freq/gp/search] [knn/range]
Arguement Description
debug/release Debug/Release mode to run, passed to CMake
build Build index
build_mem Build memory index
freq Generate visit-frequency file
gp Graph partition given index file
search Search index
knn Find k-nearest neighbors
range Range search

Configure datasets and parameters in config_local.sh

To run experiments with multiple segments and large-scale, please visist https://github.com/PwzXxm/segment-framework.