biigle/maia

Use inner product for similarity sort

Closed this issue · 2 comments

mzur commented

If we normalize feature vectors then the inner product distance should be equivalent to the cosine distance. It's recommended by pgvector. Normalize all new (and old) feature vectors and use the inner product instead of cosine because it is faster.

mzur commented

So I experimented with the vector database a bit and found out the following (@dlangenk):

The different similarity methods are about equally fast! The cosine similarity even seems fastest by a tiny margin. More importantly, if I query all ~4 million annotation candidate rows a query takes only about 4 seconds. If I query the ~20k candidates of a single MAIA job, it takes only 20 ms! So I don't see a need to try to optimize this here.

For LabelBOT we may still need an index but we might not need to switch to the inner product. Interesting fact: Data transfer and display also plays a role here. If I query all 4 million rows and return all ordered IDs, the query takes 4 seconds. If I limit the results to the 10 most similar, the query only takes 1.6 seconds! I hope that we can get this really fast even without and index, since we can probably limit the search space a lot (i.e. only relevant label trees etc.)! If the index returns accurate enough results, it should be really fast even on all annotations. But it's crucial that users get an immediate response here, too.

I was wondering why the MAIA sorting takes so long if the query only takes 20 ms. It turns out that the PHP pgvector package produces a slow query. I'll optimize the query and update the code.

I'll close this for now.

mzur commented

Update: With the optimized query, the sorting request in MAIA only takes 500 ms instead of 1000.