/vexpresso

Vexpresso is a simple and scalable multi-modal vector database built with Daft

Primary LanguagePythonApache License 2.0Apache-2.0

drawingVexpresso

Vexpresso is a simple and scalable multi-modal vector database built with Daft

PyPi version | docs

Querying Pokemon with images and text

Features

🍵 Simple: Vexpresso is lightweight and is very easy to get started!

🔌 Flexible: Unlike many other vector databases, Vexpresso supports arbitrary datatypes. This means that you can query muti-modal objects (images, audio, video, etc...)

🌐 Scalable: Because Vexpresso uses Daft, it can be scaled using Ray to multi-gpu / cpu clusters.

📚 Persistent: Easy Saving and Loading functionality: Vexpresso has easily accessible functions for saving / loading to huggingface datasets.

Installation

To install from PyPi:

pip install vexpresso

To install from source:

git clone git@github.com:shyamsn97/vexpresso.git
cd vexpresso
pip install -e .

Usage

🔥 Check out our Showcase notebook for a more detailed walkthrough!

In this simple example, we create a simple collection and embed using huggingface sentence transformers.

from typing import List, Any
import vexpresso
# import embedding functions from vexpresso
import vexpresso.embedding_functions as ef

# creating a collection object!
collection = vexpresso.create(
    data = {
        "documents":[
            "This is document1",
            "This is document2",
            "This is document3",
            "This is document4",
            "This is document5",
            "This is document6"
        ],
        "source":["notion", "google-docs", "google-docs", "notion", "google-docs", "google-docs"],
        "num_lines":[10, 20, 30, 40, 50, 60]
    }
    # backend="ray" # turn this flag on to start / connect to a ray cluster!
)

# create a simple embedding function from sentence_transformers
def hf_embed_fn(content: List[Any]):
    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
    return model.encode(content, convert_to_tensor=True).detach().cpu().numpy()

# or use a langchain embedding function
def langchain_embed_fn(content: List[Any]):
    from langchain.embeddings import OpenAIEmbeddings
    embeddings_model = OpenAIEmbeddings()
    return embeddings_model.embed_documents(content)

# embed function creates a column in the collection with embeddings. There can be more than one embedding column!
# lazy execution until .execute is called
collection = collection.embed(
    "documents",
    embedding_fn=hf_embed_fn,
    to="document_embeddings",
    # lazy=False # if this is false, execute doesn't need to be called
).execute()

# creating a queried collection with a subset of content closest to query
queried_collection = collection.query(
    "document_embeddings",
    query="query document6",
    k = 4, # return 2 closest
    lazy=False
    # query_embedding=[query1, query2, ...]
    # filter_conditions={"metadata_field":{"operator, ex: 'eq'":"value"}} # optional metadata filter
)

# batch query -- return a list of collections
# batch_queried_collection = collection.batch_query(
#     "document_embeddings",
#     queries=["doc1", "doc2"],
#     k = 2
# )

# filter collection for documents with num_lines less than or equal to 30
filtered_collection = queried_collection.filter(
    {
        "num_lines": {"lte":30}
    }
).execute()

# show dataframe
filtered_collection.show()

# convert to dictionary
filtered_dict = filtered_collection.to_dict()
documents = filtered_dict["documents"]

# add an entry!
collection = collection.add(
    [
        {"documents":"new documents 1", "source":"notion", "num_lines":2},
        {"documents":"new documents 2", "source":"google-docs", "num_lines":40}
    ]
)
collection = collection.execute()

Resources

Contributing

Feel free to make a pull request or open an issue for a feature!