Benchmarking scripts for approximate nearest neighbor search algorithms in Python. The design of this repository is strongly influenced by a great project, ann-benchmarks, that provides comprehensive and thorough benchmarks for various libraries. In contrast, we aim to provide more simple and lightweight benchmarks with the following features.
- Just three scripts
- Support Euclidean distance only
- Support Recall@1 only
- Support libraries that can be installed via pip/conda only
- Search with a single thread
- Sweep by a single query parameter
- Parameter specialization for each dataset
- Dynamic configuration from the command line
git clone https://github.com/matsui528/annbench.git
cd annbench
pip install -r requirements.txt
# conda install faiss-cpu -y -c pytorch # If you'd like to try faiss, run this on anaconda
# conda install faiss-gpu -y -c pytorch # or, if you have GPUs, install faiss-gpu
python download.py dataset=siftsmall # Downloaded on ./dataset
python run.py dataset=siftsmall algo=annoy # Indices are on ./interim. Results are on ./output
python plot.py # Plots are on ./result_img
- It may take some hours for building. Once the indices are built, the search takes some minutes.
python download.py --multirun dataset=siftsmall,sift1m
python run.py --multirun dataset=siftsmall,sift1m algo=linear,annoy,ivfpq,hnsw
# Or, if you have GPUs,
# python run.py --multirun dataset=siftsmall,sift1m algo=linear,annoy,ivfpq,hnsw,linear_gpu,ivfpq_gpu
python plot.py
- All config files are on
./conf
. You can edit config files to change parameters.
- Download a target dataset by
python download.py dataset=DATASET
. Several datasets can be downloaded at once bypython download.py --multirun dataset=DATASET1,DATASET2,DATASET3
. See hydra for more detailed APIs for multirun. - The data will be placed on
./dataset
. The logs are written on./log
.
- Evaluate a target algorithm (
ALGO
) with a target dataset (DATASET
) bypython run.py dataset=DATASET algo=ALGO
. You can run multiple algorithms on multiple datasets bypython run.py --multirun dataset=DATASET1,DATASET2 algo=ALGO1,ALGO2
. - Indices (data structures for search) are stored on
./interim
. They are reused for each run with different query parameters. - The search result will be on
./output
. The same file will be on./log
as well. For example withalgo=annoy
anddataset=siftsmall
, the result file is./output/siftsmall/annoy/result.yaml
, and it is identical to something like./log/2020-03-11/22-30-59/0/result.yaml
. - By default, we run each algorithm
num_trial=10
times and return the average runtime. You can change this by:python run.py num_trial=5
- You can visualize the search result by
python plot.py
. This script checks./output
and generate figures for each dataset on./result_img
. - As is the case in
run.py
, the same figure is written on./log
as well. - In order not to print query parameters, you can set the flag false:
python plot.py with_query_param=false
. - To change the size of the image:
python plot.py width=15 height=10
dataset | dimension | #base | #query | #train | misc |
---|---|---|---|---|---|
siftsmall | 128 | 10,000 | 100 | 25,000 | A toy dataset for hello-world |
sift1m | 128 | 1,000,000 | 10,000 | 100,000 |
- Write a wrapper class on
./annbench/algo
. The class must inheritBaseANN
class. See annoy.py for examples. - Update ./annbench/algo/proxy.py
- Add the name of the library on requirements.txt.
- Add a config file on
./conf/algo
. - Make sure the algorithm runs on a single thread
- Write a wrapper class that inherits
BaseDataset
on./annbench/dataset
. An simple example is siftsmall.py. - Update ./annbench/dataset/proxy.py.
- Add a config file on
./conf/dataset
.
- We followed the standard evaluation criteria in the field of computer vision, Recall@1. See the evaluation function for more details. These functions are from the benchmark scripts of the faiss library.
- We define a simple guideline to set parameters. An algorithm has to have several index parameters and a single query parameter. For one set of index parameters, one index (data structure) is built. For this index, we run search by sweaping the query parameter.
- For example with ivfpq, let us consider the following index parameters:
With these parameters, one index (let us denote
param_index={"M": 8, "nlist": 100}
ivfpq(M=8, nlist=100)
) is created. This index is stored in the disk asM8_nlist100.bin
, where the way of naming is defined in the function stringify_index_param. Here, a query parameter is defined as:In the search step, the index is read from the disk onto the memory first. Then we run the search five times, withparam_query={"nprobe": [1, 2, 4, 8, 16]}
for nprobe in [1, 2, 4, 8, 16]
. This creates five results (five pairs of (recall, runtime)). By connecting these results, one polyline is drawn on the final plot. - Note that the values of the above query parameter must be sorted. If you forget to sort (e.g.,
[1, 4, 2, 8, 16]
), the final graph would become weird.
- Index/query parameters for each algorithm is defined in
./conf/algo/
. These parameters are used for all datasets by default. If you'd like to specialize parameters for a dataset, you can defined the specialized version in./conf/param/
. - For example, the default parameters for
ivfpq
is defined here, wherenlist=100
. You can setnlist=1000
for the sift1m dataset by adding a config file here
- In addition to editing the config files, you can override values from the commandline thanks to hydra.
- Examples:
- Writing index structures on
SOMEWHEE/LARGE_HDD/
:python run.py interim=SOMEWHERE/LARGE_HDD/interim
- Run the
ivfpq
algorithm with different query parameters:python run.py algo=ivfpq dataset=siftsmall param_query.nprobe=[1,5,25]
- Writing index structures on
- Feel free to open a pull request