/pgvecto.rs

Vector database plugin for Postgres, written in Rust, specifically designed for LLM

Primary LanguageRustApache License 2.0Apache-2.0

pgvecto.rs

discord invitation link trackgit-views all-contributors

pgvecto.rs is a Postgres extension that provides vector similarity search functions. It is written in Rust and based on pgrx. It is currently in the beta status, we invite you to try it out in production and provide us with feedback. Read more at 📝our launch blog.

Why use pgvecto.rs

  • 💃 Easy to use: pgvecto.rs is a Postgres extension, which means that you can use it directly within your existing database. This makes it easy to integrate into your existing workflows and applications.
  • 🥅 Filtering: pgvecto.rs supports filtering. You can set conditions when searching or retrieving points. This is the missing feature of other postgres extensions.
  • 🚀 High Performance: pgvecto.rs is designed to provide significant improvements compared to existing Postgres extensions. Benchmarks have shown that its HNSW index can deliver search performance up to 20 times faster than other indexes like ivfflat.
  • 🔧 Extensible: pgvecto.rs is designed to be extensible. It is easy to add new index structures and search algorithms. This flexibility ensures that pgvecto.rs can adapt to emerging vector search algorithms and meet diverse performance needs.
  • 🦀 Rewrite in Rust: Rust's strict compile-time checks ensure memory safety, reducing the risk of bugs and security issues commonly associated with C extensions.
  • 🙋 Community Driven: We encourage community involvement and contributions, fostering innovation and continuous improvement.

Comparison with pgvector

pgvecto.rs pgvector
Transaction support ⚠️
Sufficient Result with Delete/Update/Filter ⚠️
Vector Dimension Limit 65535 2000
Prefilter on HNSW
Parallel Index build ⚡️ Linearly faster with more cores 🐌 Only single core used
Index Persistence mmap file Postgres internal storage
WAL amplification 2x 😃 30x 🧐

And based on our benchmark, pgvecto.rs can be up to 2x faster than pgvector on hnsw indexes with same configurations. Read more about the comparison at here.

Installation

We recommend you to try pgvecto.rs using our pre-built docker, by running

docker run --name pgvecto-rs-demo -e POSTGRES_PASSWORD=mysecretpassword -p 5432:5432 -d tensorchord/pgvecto-rs:latest

For more installation method (binary install or install from source), read more at docs/install.md

Get started

Run the following SQL to ensure the extension is enabled

DROP EXTENSION IF EXISTS vectors;
CREATE EXTENSION vectors;

pgvecto.rs allows columns of a table to be defined as vectors.

The data type vector(n) denotes an n-dimensional vector. The n within the brackets signifies the dimensions of the vector. For instance, vector(1000) would represent a vector with 1000 dimensions, so you could create a table like this.

-- create table with a vector column

CREATE TABLE items (
  id bigserial PRIMARY KEY,
  embedding vector(3) NOT NULL
);

You can then populate the table with vector data as follows.

-- insert values

INSERT INTO items (embedding)
VALUES ('[1,2,3]'), ('[4,5,6]');

-- or insert values using a casting from array to vector

INSERT INTO items (embedding)
VALUES (ARRAY[1, 2, 3]::real[]), (ARRAY[4, 5, 6]::real[]);

We support three operators to calculate the distance between two vectors.

  • <->: squared Euclidean distance, defined as $\Sigma (x_i - y_i) ^ 2$.
  • <#>: negative dot product distance, defined as $- \Sigma x_iy_i$.
  • <=>: negative cosine distance, defined as $- \frac{\Sigma x_iy_i}{\sqrt{\Sigma x_i^2 \Sigma y_i^2}}$.
-- call the distance function through operators

-- squared Euclidean distance
SELECT '[1, 2, 3]'::vector <-> '[3, 2, 1]'::vector;
-- negative dot product distance
SELECT '[1, 2, 3]' <#> '[3, 2, 1]';
-- negative square cosine distance
SELECT '[1, 2, 3]' <=> '[3, 2, 1]';

You can search for a vector simply like this.

-- query the similar embeddings
SELECT * FROM items ORDER BY embedding <-> '[3,2,1]' LIMIT 5;
-- query the neighbors within a certain distance
SELECT * FROM items WHERE embedding <-> '[3,2,1]' < 5;

Indexing

You can create an index, using squared Euclidean distance with the following SQL.

-- Using HNSW algorithm.

CREATE INDEX ON items USING vectors (embedding l2_ops)
WITH (options = "capacity = 2097152");

--- Or using bruteforce with PQ.

CREATE INDEX ON items USING vectors (embedding l2_ops)
WITH (options = $$
capacity = 2097152
[vectors]
memmap = "disk"
[algorithm.flat]
quantization = { product = { ratio = "x16" } }
$$);

--- Or using IVFPQ algorithm.

CREATE INDEX ON items USING vectors (embedding l2_ops)
WITH (options = $$
capacity = 2097152
[vectors]
memmap = "disk"
[algorithm.ivf]
quantization = { product = { ratio = "x16" } }
$$);

--- Or using Vamana algorithm.

CREATE INDEX ON items USING vectors (embedding l2_ops)
WITH (options = $$
capacity = 2097152
[algorithm.vamana]
$$);

Now you can perform a KNN search with the following SQL simply.

SELECT *, embedding <-> '[0, 0, 0]' AS score
FROM items
ORDER BY embedding <-> '[0, 0, 0]' LIMIT 10;

Please note, vector indexes are not loaded by default when PostgreSQL restarts. To load or unload the index, you can use vectors_load and vectors_unload.

--- get the index name
\d items

-- load the index
SELECT vectors_load('items_embedding_idx'::regclass);

We planning to support more index types (issue here).

Welcome to contribute if you are also interested!

Why not a specialized vector database?

Read our blog at modelz.ai/blog/pgvector

Reference

vector type

vector and vector(n) are all legal data types, where n denotes dimensions of a vector.

The current implementation ignores dimensions of a vector, i.e., the behavior is the same as for vectors of unspecified dimensions.

There is only one exception: indexes cannot be created on columns without specified dimensions.

Indexing

We utilize TOML syntax to express the index's configuration. Here's what each key in the configuration signifies:

Key Type Description
capacity integer The index's capacity. The value should be greater than the number of rows in your table.
vectors table Configuration of background process vector storage.
algorithm table The algorithm to be used for indexing.

Options for table algorithm.

Key Type Description
flat table If this table is set, brute force algorithm will be used for the index.
ivf table If this table is set, IVF will be used for the index.
hnsw table If this table is set, HNSW will be used for the index.
vamana table If this table is set. vamana will be used for the index.

You can choose only one algorithm in above indexing algorithms. Default value is hnsw.

Options for table vectors.

Key Type Description
memmap string "ram" keeps vectors always cached in RAM, while "disk" suggests otherwise. Default value is "ram".

Options for table flat.

Key Type Description
quantization table The algorithm to be used for quantization.

Options for table ivf.

Key Type Description
memmap string "ram" keeps algorithm storage always cached in RAM, while "disk" suggests otherwise. Default value is "ram".
build_threads integer How many threads to be used for building the index. Default value is the number of hardware threads.
max_threads integer How many threads can be used for searching in the index. Default value is twice of the number of hardware threads.
nlist integer Number of cluster units. Default value is 1000.
nprobe integer Number of units to query. Default value is 10.
least_iterations integer Least iterations for K-Means clustering. Default value is 16.
iterations integer Max iterations for K-Means clustering. Default value is 500.
quantization table The quantization algorithm to be used.

Options for table hnsw.

Key Type Description
memmap string "ram" keeps algorithm storage always cached in RAM, while "disk" suggests otherwise. Default value is "ram".
build_threads integer How many threads to be used for building the index. Default value is the number of hardware threads.
m integer Maximum degree of the node. Default value is 36.
ef_construction integer Search scope in building. Default value is 500.
quantization table The quantization algorithm to be used.

Options for table vamana.

Key Type Description
memmap string "ram" keeps algorithm storage always cached in RAM, while "disk" suggests otherwise. Default value is "ram".
build_threads integer How many threads to be used for building the index. Default value is the number of hardware threads.
r integer Maximum degree of the node. Default value is 50.
l integer Search scope in building. Default value is 70.
alpha float Slack factor in buiding. Default value is 1.2.

Options for table quantization.

Key Type Description
trivial table If this table is set, no quantization is used.
scalar table If this table is set, scalar quantization is used.
product table If this table is set, product quantization is used.

You can choose only one algorithm in above indexing algorithms. Default value is trivial.

Options for table scalar.

Key Type Description
memmap string "ram" keeps quantized vectors always cached in RAM, while "disk" suggests otherwise. Default value is "ram".

The compression ratio for scalar production is always "x4": if the size of vectors is 1024 MB, then the size of quantized vectors is 256 MB.

Options for table product.

Key Type Description
memmap string "ram" keeps quantized vectors always cached in RAM, while "disk" suggests otherwise. Default value is "ram".
sample integer Samples to be used for quantization. Default value is 65535.
ratio string Compression ratio for quantization. Only "x4", "x8", "x16", "x32", "x64" are allowed. Default value is "x4".

And you can change the number of expected result (such as ef_search in hnsw) by using the following SQL.

---  (Optional) Expected number of candidates returned by index
SET vectors.k = 32;
--- Or use local to set the value for the current session
SET LOCAL vectors.k = 32;

If you want to disable vector indexing or prefilter, we also offer some GUC options:

  • vectors.enable_vector_index: Enable or disable the vector index. Default value is on.
  • vectors.enable_prefilter: Enable or disable the prefilter. Default value is on.

Limitations

  • The index is constructed and persisted using a memory map file (mmap) instead of PostgreSQL's shared buffer. As a result, physical replication or logical replication may not function correctly. Additionally, vector indexes are not automatically loaded when PostgreSQL restarts. To load or unload the index, you can utilize the vectors_load and vectors_unload commands.
  • The filtering process is not yet optimized. To achieve optimal performance, you may need to manually experiment with different strategies. For example, you can try searching without a vector index or implementing post-filtering techniques like the following query: select * from (select * from items ORDER BY embedding <-> '[3,2,1]' LIMIT 100 ) where category = 1. This involves using approximate nearest neighbor (ANN) search to obtain enough results and then applying filtering afterwards.

Setting up the development environment

You could use envd to set up the development environment with one command. It will create a docker container and install all the dependencies for you.

pip install envd
envd up

Contributing

We need your help! Please check out the issues.

Contributors ✨

Thanks goes to these wonderful people (emoji key):

Alex Chi
Alex Chi

💻
AuruTus
AuruTus

💻
Avery
Avery

💻 🤔
Ben Ye
Ben Ye

📖
Ce Gao
Ce Gao

💼 🖋 📖
Jinjing Zhou
Jinjing Zhou

🎨 🤔 📆
Keming
Keming

🐛 💻 📖 🤔 🚇
Mingzhuo Yin
Mingzhuo Yin

💻 ⚠️ 🚇
Usamoi
Usamoi

💻 🤔
odysa
odysa

📖 💻
yihong
yihong

💻
盐粒 Yanli
盐粒 Yanli

💻
Add your contributions

This project follows the all-contributors specification. Contributions of any kind welcome!

Acknowledgements

Thanks to the following projects:

  • pgrx - Postgres extension framework in Rust
  • pgvector - Postgres extension for vector similarity search written in C