Future-House/paper-qa

Setting NumpyVectorStore as embedding model.

Snikch63200 opened this issue · 2 comments

Hello,

I read NumpyVectorStore can be used as embedding model in the doc.

Therefore, NumpyVectorStore is a good place to start, it's a simple in-memory store, without an index. However, if a larger-than-memory vector store is needed, we are currently lacking here.

(https://github.com/Future-House/paper-qa?tab=readme-ov-file#changing-embedding-model)

I have to questions about it :

  • What are fundamentals behavioural differences between NumpyVectorStore, LiteLLMEmbeddingModel, SparseEmbeddingModel ?
  • How correctly setting up a NumpyVectorStore object ?

Best regards. @dosu.

The fundamental behavioral differences between NumpyVectorStore, LiteLLMEmbeddingModel, and SparseEmbeddingModel are not explicitly detailed in the documentation. However, here are some insights:

  1. NumpyVectorStore: This is a simple in-memory vector store without indexing, making it suitable for smaller datasets. It is used to store and manage embeddings of texts within the Docs collection [1].

  2. LiteLLMEmbeddingModel: This model defaults to the "text-embedding-3-small" configuration and allows customization through parameters like rate_limit and additional arguments for the litellm.aembedding function [2].

  3. SparseEmbeddingModel: This model defaults to 256 dimensions, which can be adjusted using the ndim argument. It uses a specific encoding obtained from tiktoken.get_encoding("cl100k_base") [2].

To correctly set up a NumpyVectorStore object, you typically instantiate it and specify the embedding model when adding documents to the Docs object. The NumpyVectorStore is used as the default field for indexing texts in the Docs class. When adding documents, the embeddings are calculated and managed by the NumpyVectorStore [1].

To continue talking to Dosu, mention @dosu.

Hi @Snikch63200

The vector store stores embedded texts and can retrieve them using similarity.

The embedding model is the model that the convert a text into a vector.

We previously supported other vector stores like ChromaDB, FAISS, etc. These approximate the exact vector retrieval, which is what NumpyVectorStore does, but at a much better performance for extremely large sets of vectors. We haven't empirically seen issues though, even with hundreds of thousands of vectors, and so dropped support for these externally managed solutions. Now the only store is the NumpyVectorStore