Question: Score on multiple vectors
coreation opened this issue · 9 comments
Hi @lior-k ,
first of all, many thanks for maintaining this plugin, I haven't tried it yet but it's something we were about to code ourselves.
I just had a question with regards to comparing documents to multiple vectors and how that would work, I presume through a fork or even something you might want to merge into the code base.
Going by the code, I think the main thing to changes is in this function where the value that is fetched is then treated as an array of vectors, on which the calculations are then performed, instead of the just the 1 vector that is now used... Is that correct and if so, is that something you might want to merge if/when we make the code change in a fork?
Keep up the good work!
Hi, thanks for being interested. I'm glad the plugin can be of use for you, and that might be a good edition, as long as it will not affect latency or throughput.
Your idea sounds interesting. I'm especially interested in your use-case, can you share it?
How many vectors do you think
Hi @lior-k good point on the latency and throughput, I have no idea so far on what the effect will be.
Our use case is probably limited to a relatively low number of vectors (ranging from 3-20ish) vectors, which is I can imagine quite a big deal in terms of compute... so I'm cautious to call it a low amount in retrospect ^^.
The use case we want build is to return documents that match several pieces of text. Those pieces of text are sometimes already embedded, but can be search queries as well. The embedding isn't an issue, we have our models ready, but scaling the similarity search is what we want to use Elastic for.
For example, return all documents that match a general research question, but also take into account "starred" documents. In a twitter example you would search for tweets matching a certain piece of text, but also take into account tweets the user has already marked as "favorite". In our use case documents are mostly larger pieces of text like papers, patents and news content.
Does that explain the use case for you?
Edit: Look at it like the "MoreLikeThis" functionality in ElasticSearch, where you can pass one or more document IDs or pieces of text the results should match closely. However in our case, the combined score (min, max, avg - not sure what combinatory operator would be best) would be the preferred score to rank results on.
hi @lior-k indeed ES works on a 1 score basis, but as you rightly figured out, the score would be a simple average, weighted avg, ... whatever works for us. The main reason why we're not concatenating all of our pieces of text a search must be similar to, is that those pieces can vary (heavily) in size, making it so that the resulting vector could make it so that less or worse results are returned. It's something we need to try out and see :)
On a different note, I've tried the code and it seems to be working fine, we do have in our index documents that have no embedding, am I correct to say that will mess up the results? From first tries it looks like it does, I've seen in other issues and the README that the field must be there, but does every document need an embedding value or can they be empty for it to work?
Ok, that makes sense, I'll make fork and see where I end up. If I can do it with a parameter so that it can be part of this code base, I won't hesitate to do so. Thanks so much for being so responsive!
@lior-k the new developments on ElasticSearch 7.x, are they implementing the same thing? I see on their blogposts that searching (scoring) through similarity is still on ongoing thing that they're looking to expand, but is this repository in essentially doing the same that Elastic is doing in 7.x with their dense vector field type and the availability to script the score through cosineSimilarity for example?
Thanks for all the help so far @lior-k , based on what this repository is meant for (1-1 vector comparison) we decided to just start from scratch. We needed 4 different comparable parts with different weights to them in order to make things work for our use case, in my opinion that's far from what this repository is aiming to do.
So I'll close this issue with a big thanks for your open source work!