Add support for `Chroma`
emrgnt-cmplxty opened this issue · 5 comments
We currently have an abstract base claseVectorDatabaseProvider
which defines the interfaces and behaviors expected of any vector database provider. This includes functions such as save
, load
, add
, update_database
, clear
, get_ordered_embeddings
, contains
, discard
, get
and entry_to_key
.
We also have implemented a concrete class JSONVectorDatabase
that provides the actual implementation of these interfaces for JSON file-based storage.
Now, we want to add support for a new provider called Chroma
.
To do this, we need a new concrete class that extends the VectorDatabaseProvider
abstract base class and provides the implementation specific to the Chroma vector database.
Below is an example of how you can implement this. The implementation is simplified and might need adjustments to fit our use case.
Please note that I have assumed T
to be a tuple of a document and its metadata and K
to be the id of the document.
from typing import Tuple
class ChromaVectorDatabase(VectorDatabaseProvider[Tuple[str, Dict[str, Any]], str]):
"""Concrete class to provide a vector database that uses Chroma."""
def __init__(self, collection_name: str):
self.client = chromadb.Client()
self.collection = self.client.get_or_create_collection(collection_name)
def save(self):
# Chroma handles persistence automatically.
pass
def load(self):
# In Chroma, the data loading happens automatically when creating or getting the collection.
pass
def add(self, entry: Tuple[str, Dict[str, Any]]):
document, metadata = entry
self.collection.add(
documents=[document],
metadatas=[metadata],
ids=[self.entry_to_key(entry)]
)
def update_database(self, entry: Tuple[str, Dict[str, Any]]):
# Chroma might not support updating entries. You may need to delete and re-add.
self.discard(self.entry_to_key(entry))
self.add(entry)
def discard(self, key: str):
# Assuming Chroma has a delete method
self.collection.delete(ids=[key])
def contains(self, key: str) -> bool:
# Assuming Chroma has a get method which raises an error if the id does not exist
try:
self.collection.get(ids=[key])
return True
except:
return False
def get(self, key: str) -> Tuple[str, Dict[str, Any]]:
# Assuming Chroma has a get method which returns the document and its metadata
return self.collection.get(ids=[key])
def clear(self):
# Assuming Chroma has a method to delete all entries in a collection
self.collection.delete_all()
def get_ordered_embeddings(self) -> List[Tuple[str, Dict[str, Any]]]:
# Chroma might not support retrieving all entries ordered by their similarity to a given vector.
# You will need to adjust this method based on your specific use case and Chroma's capabilities.
pass
def entry_to_key(self, entry: Tuple[str, Dict[str, Any]]) -> str:
# The key is assumed to be the first sentence of the document
return entry[0].split(".")[0]
I'd recommend reading the documentation or source code of Chroma to find out how to implement the exact behaviors required by your abstract base class. For instance, the get_ordered_embeddings
method might not be possible to implement with Chroma's current API. Similarly, it is unclear from the provided information whether Chroma supports updating entries or only adding and deleting them. To fully implement Chroma we will need to work through all of these details and more.
Great idea! We'll be using a local chroma instance I presume?
For now, it should be easy enough to switch out with a cloud provider
First pass here - 0aec1dd
Further refined here - #148
merging in now.