A lightweight, fast, flexible and easy-to-use benchmarking tool specifically designed for the performance evaluation and optimization of pgvector.
pgvectorbench
consists of five phases, each of which can be run independently or chained together to achieve a comprehensive benchmarking process:
- Setup: This phase involves setting up the benchmarking table and potentially creating indexes before loading data into the table. Additionally, any necessary extensions can be created during this phase.
- Load: In this phase, the dataset is loaded into the benchmarking table. Efficient data loading mechanisms are implemented to ensure that the dataset is ingested quickly and reliably, ready for subsequent phases.
- Index: After the data has been loaded, this phase is dedicated to the construction of indexes. It is designed to potentially yield more optimized index build times.
- Query: Benchmarking queries are executed in this phase, and metrics such as queries per second (QPS), latency, and recall are calculated. Latency and recall are determined using user-specified percentages.
- Teardown: This final phase involves performing any necessary cleanup tasks after the benchmarking is complete. This may include dropping indexes, truncating or dropping tables, and removing any extensions that were created during setup.
Real world dataset matters, pgvectorbench support two kinds of datasets for now:
- VECS
- The vectors are stored in raw little endian. Each vector takes 4+d*4 bytes for .fvecs and .ivecs formats, and 4+d bytes for .bvecs formats, where d is the dimensionality of the vector.
- Parquet
- Curated by Zilliz, is uniformly structured in the efficient Parquet file format. Use
aws s3 ls s3://assets.zilliz.com/benchmark/ --region us-west-2 --no-sign-request
to list all datasets. - In specific use cases, complex query formulations can be designed to include supplementary filter conditions on non-vector attributes, thereby refining the search criteria.
- Curated by Zilliz, is uniformly structured in the efficient Parquet file format. Use
Datasets details:
Dataset | Format | Metric | dim | nb base vectors | nb query vectors | Download |
---|---|---|---|---|---|---|
siftsmall | VECS | L2 | 128 | 10,000 | 100 | wget ftp://ftp.irisa.fr/local/texmex/corpus/siftsmall.tar.gz |
sift | VECS | L2 | 128 | 1,000,000 | 10,000 | wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz |
gist | VECS | L2 | 960 | 1,000,000 | 1,000 | wget ftp://ftp.irisa.fr/local/texmex/corpus/gist.tar.gz |
glove | VECS | L2 | 100 | 1,183,514 | 10,000 | wget http://downloads.zjulearning.org.cn/data/glove-100.tar.gz |
crawl | VECS | L2 | 300 | 1,989,995 | 10,000 | wget http://downloads.zjulearning.org.cn/data/crawl.tar.gz |
deep1B | VECS | L2 | 96 | 1,000,000,000 | 10,000 | https://yadi.sk/d/11eDCm7Dsn9GA chunks must be concatenated into one file(deep1B_base.fvecs) before loading |
cohere_small_100k | Parquet | COSINE | 768 | 100,000 | 1,000 | aws s3 cp s3://assets.zilliz.com/benchmark/cohere_small_100k/ ./cohere_small_100k/ --region us-west-2 --recursive --no-sign-request |
cohere_small_100k_filter1 | Parquet | COSINE | 768 | 100,000 | 1,000 | same as 👆🏻 |
cohere_small_100k_filter99 | Parquet | COSINE | 768 | 100,000 | 1,000 | same as 👆🏻 |
cohere_medium_1m | Parquet | COSINE | 768 | 1,000,000 | 1,000 | aws s3 cp s3://assets.zilliz.com/benchmark/cohere_medium_1m/ ./cohere_medium_1m/ --region us-west-2 --recursive --no-sign-request |
cohere_medium_1m_filter1 | Parquet | COSINE | 768 | 1,000,000 | 1,000 | same as 👆🏻 |
cohere_medium_1m_filter99 | Parquet | COSINE | 768 | 1,000,000 | 1,000 | same as 👆🏻 |
cohere_large_10m | Parquet | COSINE | 768 | 10,000,000 | 1,000 | aws s3 cp s3://assets.zilliz.com/benchmark/cohere_large_10m/ ./cohere_large_10m/ --region us-west-2 --recursive |
cohere_large_10m_filter1 | Parquet | COSINE | 768 | 10,000,000 | 1,000 | same as 👆🏻 |
cohere_large_10m_filter99 | Parquet | COSINE | 768 | 10,000,000 | 1,000 | same as 👆🏻 |
openai_small_50k | Parquet | COSINE | 1536 | 50,000 | 1,000 | aws s3 cp s3://assets.zilliz.com/benchmark/openai_small_50k/ ./openai_small_50k/ --region us-west-2 --recursive --no-sign-request |
openai_small_50k_filter1 | Parquet | COSINE | 1536 | 50,000 | 1,000 | same as 👆🏻 |
openai_small_50k_filter99 | Parquet | COSINE | 1536 | 50,000 | 1,000 | same as 👆🏻 |
openai_medium_500k | Parquet | COSINE | 1536 | 500,000 | 1,000 | aws s3 cp s3://assets.zilliz.com/benchmark/openai_medium_500k/ ./openai_medium_500k/ --region us-west-2 --recursive --no-sign-request |
openai_medium_500k_filter1 | Parquet | COSINE | 1536 | 500,000 | 1,000 | same as 👆🏻 |
openai_medium_500k_filter99 | Parquet | COSINE | 1536 | 500,000 | 1,000 | same as 👆🏻 |
openai_large_5m | Parquet | COSINE | 1536 | 5,000,000 | 1,000 | aws s3 cp s3://assets.zilliz.com/benchmark/openai_large_5m/ ./openai_large_5m/ --region us-west-2 --recursive --no-sign-request |
openai_large_5m_filter1 | Parquet | COSINE | 1536 | 5,000,000 | 1,000 | same as 👆🏻 |
openai_large_5m_filter99 | Parquet | COSINE | 1536 | 5,000,000 | 1,000 | same as 👆🏻 |
laion_large_100m | Parquet | L2 | 768 | 100,000,000 | 1,000 | aws s3 cp s3://assets.zilliz.com/benchmark/laion_large_100m/ ./laion_large_100m/ --region us-west-2 --recursive --no-sign-request |
MacOS
brew install apache-arrow
brew install libpq
Debian
wget https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
sudo apt install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
sudo apt update
sudo apt install -y libparquet-dev libpq-dev
git submodule update --init --recursive
mkdir build && cd build
cmake .. && make -j
docker build -t pgvectorbench .
./pgvectorbench --help
Usage: pgvectorbench [--help] [--version] [--host VAR] [--port VAR] [--username VAR] [--password VAR] [--dbname VAR] [--dataset VAR] [--path VAR] [--log VAR] [--setup VAR] [--load VAR] [--index VAR] [--query VAR] [--teardown VAR]
Optional arguments:
-h, --help shows help message and exits
-v, --version prints version information and exits
-h, --host database server host or socket directory
-p, --port database server port
-U, --username database user name
-W, --password password for the specified user
-d, --dbname database name to connect to
-D, --dataset dataset name used to run the benchmark [nargs=0..1] [default: "siftsmall"]
-P, --path dataset path
-l, --log send log to file
--setup k/v pairs seperated by semicolon for setup options [nargs=0..1] [default: ""]
--load k/v pairs seperated by semicolon for loading dataset [nargs=0..1] [default: ""]
--index k/v pairs seperated by semicolon for creating index
--query k/v pairs seperated by semicolon for running the benchmarking queries [nargs=0..1] [default: ""]
--teardown k/v pairs seperated by semicolon for teardown options [nargs=0..1] [default: ""]
All parameters for the five phases must be specified as key=value
pairs, with semicolons used to separate each pair. When supplying multiple key=value
pairs, the entire parameter list should be enclosed in double quotes.
If you organize your dataset with the following structure and place it in the /opt/datasets
directory, you can omit the --path
option in all subsequent pgvectorbench commands.
➜ datasets tree -L 2
.
├── parquet
│ ├── cohere_medium_1m
│ ├── cohere_small_100k
│ └── openai_small_50k
└── vecs
├── crawl
├── gist
├── glove-100
├── sift
└── siftsmall
For instance, if you intend to execute a comprehensive test utilizing the siftsmall
dataset, you would proceed as follows:
./pgvectorbench -d postgres --path /home/zhjwpku/datasets/vecs/siftsmall --setup --load --index="index_type=hnsw;m=32;ef_construction=200" --query="loop=10;hnsw.ef_search=100;percentages=90,99,99.5,99.9"
As the previous command did not specify any teardown
options, you have the flexibility to schedule another query round, potentially with a different setting for hnsw.ef_search
:
./pgvectorbench -d postgres --path /home/zhjwpku/datasets/vecs/siftsmall --query="loop=10;hnsw.ef_search=200;percentages=90,99,99.5,99.9"
After benchmarking, you have the option to drop index individually during the teardown phase by executing the following command:
./pgvectorbench -d postgres --path /home/zhjwpku/datasets/vecs/siftsmall --teardown=drop_index=y
And the and potentially run another round of query Subsequently, you create another index and potentially followed by initiating another series of queries to further measure the database's performance:
./pgvectorbench -d postgres --path /home/zhjwpku/datasets/vecs/siftsmall --index="maintenance_work_mem=2GB;index_type=hnsw;m=64;ef_construction=200" --query="loop=10;hnsw.ef_search=200;percentages=90,99,99.5,99.9"
Prior to initiating the actual benchmarking process, one can prewarm the database by either omitting the loop
parameter or setting its value to 1:
./pgvectorbench -d postgres --path /home/zhjwpku/datasets/vecs/siftsmall --query="loop=1;hnsw.ef_search=200;percentages=90,99,99.5,99.9"
./pgvectorbench -d postgres --path /home/zhjwpku/datasets/vecs/siftsmall --query="loop=10;hnsw.ef_search=200;percentages=90,99,99.5,99.9"
As shown by previous examples, pgvectorbench, through the combination of its five phases, is capable of executing a diverse range of performance tests, which is why I consider pgvectorbench to be highly flexible.
There are additional parameters that can be configured for each phase, including but not limited to thread_num
, batch_size
, and table_name
. For an exhaustive list, I recommend referring to the source file.
If you are using docker, you should mount the host's datasets directory to the container's /opt/datasets
path and also specify the host for the PostgreSQL server.
docker run -it --mount type=bind,source=/home/zhjwpku/datasets,target=/opt/datasets pgvectorbench -- -h 192.168.31.32 -U zhjwpku -d postgres --query="loop=10;hnsw.ef_search=100;percentages=90,99,99.9"