Different vector storage for write-heavy workflows
asg017 opened this issue · 0 comments
With the current "faiss in a shadow table" storage strategy, there's one drawback: every time data in a vss0
table is inserted or updated and a COMMIT
happens, the entire Faiss index has to be written to the DB. So if your index as 1 million vectors and you insert 1 more, then 1,000,001 vectors will have to be written to the DB.
This is a limitation of Faiss's IO tools: I believe only entire writes are supported (unless you mmap which isn't possible with in-db storage).
Proposal: A non-faiss index storage option
We should created our own be-spoke index storage option that allows for incremental, streaming writes without re-writing the entire index. This would be outside of Faiss, but can still use some of Faiss's C++ API for querying.
For example, we could have:
create virtual table vss_demo using vss0(
index_storage="incremental",
chunk_embeddings(768)
);
insert into demo(rowid, chunk_embeddings)
select rowid, chunk_embeddings from demo;
-- inserting 1 row should be fast and not re-write the entire index
insert into demo(rowid, chunk_embeddings)
values (1234, :embedding);
The index_storage="incremental"
option is the key: that signals that instead of the default faiss-shadow-table index storage strategy, the vss_demo
table should instead store vectors in this new "incremental" storage format that's not Faiss-based.
We can still use Faiss for actual KNN and range searches, through knn_L2sqr_by_idx ()
and range_search_L2sqr()
. But we'll need to design our own SQLite-based vector storage solution.
Design notes
This is a WIP, will likely change.
create table vss_demo_config(
chunk_size integer
);
create table vss_demo_streaming_chunks(
-- blob of size `chunk_size * sizeof(int64)`, storing the rowid values of the corresponding vectors in `vectors`
rowids blob,
-- bitmap of size `chunk_size / 8` that tracks if the i-th vector is valid/not-deleted
valid blob,
-- blob of size `chunk_size * sizeof(float), containing the raw float vectors
vectors blob
);
At creation time, an all-zero chunk of size chunk_size
is inserted into vss_demo_streaming_chunks
. At insert time, each query and their rowid is stored in the rowids
and vectors
column using SQLite's BLOB I/O API, and the i-th bit in the valid
bitmap is updated. At query time, the vectors
and rowids
columns are read into memory, the range_search_L2sqr()
function is called to find the K-nearest vectors in that chunk, and the chunks are de-allocated and the next chunk is worked on. In the end, we re-sort the the results and get the true K-nearest vectors.
Possibly could be a new operation="optimize"
option that'll re-arrange these chunks for delete-heavy workflows, to save space.
insert into vss_demo(operation) values ('optimize');
Disadvantages
- This index will likely be slower to query, since each query will have to read from the disk. However, it'll be reading from the already-opened SQLite database and not external files, and small BLOBs in SQLite are extremely fast, so this may not be noticeable in most applications.
- This probably can only support flat indexes, ie it'll be an exhaustive search. Possibly could incorporate something like HNSW or IVF, but that'll be complex and further in the future.