Setting NumpyVectorStore as embedding model.
Snikch63200 opened this issue · 2 comments
Hello,
I read NumpyVectorStore can be used as embedding model in the doc.
Therefore, NumpyVectorStore is a good place to start, it's a simple in-memory store, without an index. However, if a larger-than-memory vector store is needed, we are currently lacking here.
(https://github.com/Future-House/paper-qa?tab=readme-ov-file#changing-embedding-model)
I have to questions about it :
- What are fundamentals behavioural differences between NumpyVectorStore, LiteLLMEmbeddingModel, SparseEmbeddingModel ?
- How correctly setting up a NumpyVectorStore object ?
Best regards. @dosu.
The fundamental behavioral differences between NumpyVectorStore
, LiteLLMEmbeddingModel
, and SparseEmbeddingModel
are not explicitly detailed in the documentation. However, here are some insights:
-
NumpyVectorStore: This is a simple in-memory vector store without indexing, making it suitable for smaller datasets. It is used to store and manage embeddings of texts within the
Docs
collection [1]. -
LiteLLMEmbeddingModel: This model defaults to the
"text-embedding-3-small"
configuration and allows customization through parameters likerate_limit
and additional arguments for thelitellm.aembedding
function [2]. -
SparseEmbeddingModel: This model defaults to 256 dimensions, which can be adjusted using the
ndim
argument. It uses a specific encoding obtained fromtiktoken.get_encoding("cl100k_base")
[2].
To correctly set up a NumpyVectorStore
object, you typically instantiate it and specify the embedding model when adding documents to the Docs
object. The NumpyVectorStore
is used as the default field for indexing texts in the Docs
class. When adding documents, the embeddings are calculated and managed by the NumpyVectorStore
[1].
Hi @Snikch63200
The vector store stores embedded texts and can retrieve them using similarity.
The embedding model is the model that the convert a text into a vector.
We previously supported other vector stores like ChromaDB, FAISS, etc. These approximate the exact vector retrieval, which is what NumpyVectorStore does, but at a much better performance for extremely large sets of vectors. We haven't empirically seen issues though, even with hundreds of thousands of vectors, and so dropped support for these externally managed solutions. Now the only store is the NumpyVectorStore