emrgnt-cmplxty/automata

Add support for `Chroma`

Closed this issue · 5 comments

We currently have an abstract base claseVectorDatabaseProvider which defines the interfaces and behaviors expected of any vector database provider. This includes functions such as save, load, add, update_database, clear, get_ordered_embeddings, contains, discard, get and entry_to_key.

We also have implemented a concrete class JSONVectorDatabase that provides the actual implementation of these interfaces for JSON file-based storage.

Now, we want to add support for a new provider called Chroma.

To do this, we need a new concrete class that extends the VectorDatabaseProvider abstract base class and provides the implementation specific to the Chroma vector database.

Below is an example of how you can implement this. The implementation is simplified and might need adjustments to fit our use case.

Please note that I have assumed T to be a tuple of a document and its metadata and K to be the id of the document.

from typing import Tuple

class ChromaVectorDatabase(VectorDatabaseProvider[Tuple[str, Dict[str, Any]], str]):
    """Concrete class to provide a vector database that uses Chroma."""

    def __init__(self, collection_name: str):
        self.client = chromadb.Client()
        self.collection = self.client.get_or_create_collection(collection_name)

    def save(self):
        # Chroma handles persistence automatically.
        pass

    def load(self):
        # In Chroma, the data loading happens automatically when creating or getting the collection.
        pass

    def add(self, entry: Tuple[str, Dict[str, Any]]):
        document, metadata = entry
        self.collection.add(
            documents=[document], 
            metadatas=[metadata], 
            ids=[self.entry_to_key(entry)]
        )

    def update_database(self, entry: Tuple[str, Dict[str, Any]]):
        # Chroma might not support updating entries. You may need to delete and re-add.
        self.discard(self.entry_to_key(entry))
        self.add(entry)

    def discard(self, key: str):
        # Assuming Chroma has a delete method
        self.collection.delete(ids=[key])

    def contains(self, key: str) -> bool:
        # Assuming Chroma has a get method which raises an error if the id does not exist
        try:
            self.collection.get(ids=[key])
            return True
        except:
            return False

    def get(self, key: str) -> Tuple[str, Dict[str, Any]]:
        # Assuming Chroma has a get method which returns the document and its metadata
        return self.collection.get(ids=[key])

    def clear(self):
        # Assuming Chroma has a method to delete all entries in a collection
        self.collection.delete_all()

    def get_ordered_embeddings(self) -> List[Tuple[str, Dict[str, Any]]]:
        # Chroma might not support retrieving all entries ordered by their similarity to a given vector. 
        # You will need to adjust this method based on your specific use case and Chroma's capabilities.
        pass

    def entry_to_key(self, entry: Tuple[str, Dict[str, Any]]) -> str:
        # The key is assumed to be the first sentence of the document
        return entry[0].split(".")[0]

I'd recommend reading the documentation or source code of Chroma to find out how to implement the exact behaviors required by your abstract base class. For instance, the get_ordered_embeddings method might not be possible to implement with Chroma's current API. Similarly, it is unclear from the provided information whether Chroma supports updating entries or only adding and deleting them. To fully implement Chroma we will need to work through all of these details and more.

Great idea! We'll be using a local chroma instance I presume?

For now, it should be easy enough to switch out with a cloud provider

First pass here - 0aec1dd

Further refined here - #148

merging in now.