support selection of similarity metrics for diversity and relevancy in MMR
Opened this issue · 3 comments
Background
- The literature seems unclear on what similarity metrics perform best for diversity and relevancy. (if anyone has found any good analysis on this would be great to see).
- bm25 works better if a lot of text pre-processing is performed (stemming / lemmatization, word normalization, stop word removal, etc.) that is not as common in genAI / embedding workflows. User data may be better suited to a vector search similarity function instead of keyword type method
Suggestion
We have already implemented bm25 and cosine similarity. Allow users to select which similarity method they want to use (with reasonable defaults).
I agree that we must make it configurable to chose which metrics use.
But I disagree that we could let users use a vector search similarity function (like cosine similarity) for ensuring "diversity" in BM25. The documents have been already retrieved from the vector database as the closest according to the same function, so using the function won't help in reducing redundancy on the set of documents sent to the LLM in the prompt.
One of the main benefits of LangStream, thanks to its asynchronous nature, is that it makes it easy to perform preprocessing before storing the text on the vector database (we already have a a few agents that help with a good configuration out-of-the-box)
@acantarero do you have some proposal of other metrics to use ?