GhostDB is a Python package that provides a fast and efficient way to store and search for embeddings using HNSW (Hierarchical Navigable Small World) indexing from the hnswlib library. It is designed for applications that require fast nearest neighbor search, such as natural language processing, image recognition, and recommendation systems.
- Fast approximate nearest neighbor search using HNSW indexing
- Support for high-dimensional data
- Persistence of index and text data on disk
- Easy-to-use API for adding, searching, and managing embeddings
- Support for batch operations
pip install pyghostdb
Here is an example of how to use GhostDB:
import numpy as np
from pyghostdb.ghost_storage import GhostStorage
# Initialize GhostStorage with default settings
ghost_storage = GhostStorage()
# Add an embedding to the storage
text_id = 1
text = "Sample text"
embedding = np.random.rand(1536)
ghost_storage.add(text_id, text, embedding)
# Search for the nearest neighbor of a query embedding
query_embedding = np.random.rand(1536)
result = ghost_storage.search(query_embedding, k=1)
# Get the text and embedding of the nearest neighbor
nearest_text, nearest_embedding = result[0]
print(f"Nearest text: {nearest_text}, Nearest embedding: {nearest_embedding}")
# Persist the index and text storage to disk
ghost_storage.persist()
# Load the index and text storage from disk
ghost_storage.load()
The main class for storing and searching embeddings.
Initialize the GhostStorage instance.
dim
: The dimension of the embeddings (default: 1536)max_elements
: The maximum number of elements that can be stored in the index (default: 10^5)persist_dir
: The directory where the index and text storage will be persisted (default: "ghost_dir")
Add a single text and its embedding to the storage.
text_id
: The unique identifier of the texttext
: The text associated with the embeddingembedding
: The embedding as a numpy array or list
Add multiple texts and their embeddings to the storage.
ids
: A list of unique identifiers for the textstexts
: A list of texts associated with the embeddingsembeddings
: A numpy array containing the embeddings
Search for the k nearest neighbors of a query embedding.
embedding
: The query embedding as a numpy array or listk
: The number of nearest neighbors to search for (default: 1)
Returns a list of tuples containing the text and its embedding for each nearest neighbor.
Clear the storage and remove the index and text storage from disk.
Persist the index and text storage to disk.
Load the index and text storage from disk.