Embeddings storage

Question

Embeddings storage

Opened this issue 10 months ago · 0 comments

The first approach used was to create a separate command that created the embeddings and stored them in the database using pgvector. But to implement Semantic Search we have to index the embedding in Solr anyway so to simplify we dropped pgvector and the embeddings table entirely and just computed the embedding every time we are indexing a dataset (in the before_dataset_index plugin hook).

This has the benefit of not having to worry about the embeddings being up to date if you for instance update a dataset title, but it's very likely not performant enough, certainly when calling an external API because the hooks are called on each individual dataset, so we can't submit data in bulk.

A probably better option would be to have the embeddings cached in the database, being created beforehand in the after_dataset_create and after_dataset_update hooks. We might not even need pgvector, just store them as arrays of floats or even strings, which is what we actually send to Solr.

An additional CLI command to refresh all (or some) embeddings would still be useful to change models, etc