Different vector storage for write-heavy workflows

Question

Different vector storage for write-heavy workflows

asg017 opened this issue a year ago · 0 comments

With the current "faiss in a shadow table" storage strategy, there's one drawback: every time data in a vss0 table is inserted or updated and a COMMIT happens, the entire Faiss index has to be written to the DB. So if your index as 1 million vectors and you insert 1 more, then 1,000,001 vectors will have to be written to the DB.

This is a limitation of Faiss's IO tools: I believe only entire writes are supported (unless you mmap which isn't possible with in-db storage).

Proposal: A non-faiss index storage option

We should created our own be-spoke index storage option that allows for incremental, streaming writes without re-writing the entire index. This would be outside of Faiss, but can still use some of Faiss's C++ API for querying.

For example, we could have:

create virtual table vss_demo using vss0(
  index_storage="incremental",
  chunk_embeddings(768)
);

insert into demo(rowid, chunk_embeddings) 
  select rowid, chunk_embeddings from demo;


-- inserting 1 row should be fast and not re-write the entire index
insert into demo(rowid, chunk_embeddings)
  values (1234, :embedding);

The index_storage="incremental" option is the key: that signals that instead of the default faiss-shadow-table index storage strategy, the vss_demo table should instead store vectors in this new "incremental" storage format that's not Faiss-based.

We can still use Faiss for actual KNN and range searches, through knn_L2sqr_by_idx () and range_search_L2sqr(). But we'll need to design our own SQLite-based vector storage solution.

Design notes

This is a WIP, will likely change.

create table vss_demo_config(
  chunk_size integer
);
create table vss_demo_streaming_chunks(
  -- blob of size `chunk_size * sizeof(int64)`, storing the rowid values of the corresponding vectors in `vectors`
  rowids blob,
  -- bitmap of size `chunk_size / 8` that tracks if the i-th vector is valid/not-deleted
  valid blob,
  -- blob of size `chunk_size * sizeof(float), containing the raw float vectors
  vectors blob
);

At creation time, an all-zero chunk of size chunk_size is inserted into vss_demo_streaming_chunks. At insert time, each query and their rowid is stored in the rowids and vectors column using SQLite's BLOB I/O API, and the i-th bit in the valid bitmap is updated. At query time, the vectors and rowids columns are read into memory, the range_search_L2sqr() function is called to find the K-nearest vectors in that chunk, and the chunks are de-allocated and the next chunk is worked on. In the end, we re-sort the the results and get the true K-nearest vectors.

Possibly could be a new operation="optimize" option that'll re-arrange these chunks for delete-heavy workflows, to save space.

insert into vss_demo(operation) values ('optimize');

Disadvantages

This index will likely be slower to query, since each query will have to read from the disk. However, it'll be reading from the already-opened SQLite database and not external files, and small BLOBs in SQLite are extremely fast, so this may not be noticeable in most applications.
This probably can only support flat indexes, ie it'll be an exhaustive search. Possibly could incorporate something like HNSW or IVF, but that'll be complex and further in the future.