At its core, Curator constructs a memory-efficient clustering tree that indexes all vectors and embeds multiple per-label indexes as sub-trees. These per-label indexes are not only extremely lightweight but also capture the unique vector distribution of each label, leading to high search performance and a low memory footprint. Furthermore, each per-label index can be constructed and updated independently with minimal cost, and multiple per-label indexes can be flexibly composed to handle queries with complex filter predicates.
-
3rd_party/faiss
: C++ impl of Curator and baselinesMultiTenantIndexIVFHierarchical.cpp
: CuratorMultiTenantIndexIVFFlat.cpp
: IVF with metadata filteringMultiTenantIndexIVFFlatSep.cpp
: IVF with per-tenant indexingMultiTenantIndexHNSW.cpp
: HNSW with metadata filtering
-
indexes
: Python API for indexesivf_hier_faiss.py
: Curatorivf_flat_mt_faiss.py
: IVF with metadata filteringivf_flat_sepidx_faiss.py
: IVF with per-tenant indexinghnsw_mt_hnswlib.py
: HNSW with metadata filteringhnsw_sepidx_hnswlib.py
: HNSW with per-tenant indexing
-
dataset
: code for evaluation datasetsarxiv_dataset.py
: arXiv datasetyfcc100m_dataset.py
: YFCC100M dataset
-
benchmark
: code for running benchmarks
We assume that you have installed Anaconda. To install the required Python packages, run the following command:
conda env create -f environment.yml -n ann_bench
conda activate ann_bench
cd 3rd_party/faiss
cmake -B build . \
-DFAISS_ENABLE_GPU=OFF \
-DFAISS_ENABLE_PYTHON=ON \
-DCMAKE_BUILD_TYPE=Release \
-DFAISS_OPT_LEVEL=avx2 \
-DBUILD_TESTING=ON
make -C build -j32 faiss_avx2
make -C build -j32 swigfaiss_avx2
cd build/faiss/python
python setup.py install
mkdir -p data/yfcc100m
yfcc100m_base_url="https://dl.fbaipublicfiles.com/billion-scale-ann-benchmarks/yfcc100M"
wget -P data/yfcc100m ${yfcc100m_base_url}/base.10M.u8bin
wget -P data/yfcc100m ${yfcc100m_base_url}/base.metadata.10M.spmat
mkdir -p data/arxiv
# manually download arxiv dataset from https://www.kaggle.com/datasets/Cornell-University/arxiv
# and put it at data/arxiv/arxiv-metadata-oai-snapshot.json
python -m dataset.yfcc100m_dataset
python -m dataset.arxiv_dataset
# Download the cuda-keyring package for updating the CUDA linux GPG repository key
# https://developer.nvidia.com/blog/updating-the-cuda-linux-gpg-repository-key/
# Please replace $distro and $arch with your own distro and arch
wget https://developer.download.nvidia.com/compute/cuda/repos/$distro/$arch/cuda-keyring_1.0-1_all.deb
sudo docker build --rm -t ann-bench .
Please refer to scripts in scripts
folder for details. For example, to evaluate Curator on YFCC100M dataset, run the following command:
python=$(which python) # assuming conda env is activated
sudo ${python} \
run_parallel_exp.py run_curator_overall_exp \
--dataset yfcc100m \
--cpu-limit 0 \
--mem_limit 20000000000 \
--num_runs 1