Semantic Search Engine with Vectorized Databases

This repository contains the code and documentation for a semantic search engine that utilizes vectorized databases. The project's primary focus is on building an efficient indexing system to retrieve the top-k most similar vectors based on a given query vector. It also includes an evaluation of the performance of the implemented indexing system. The system is designed to retrieve information based on vector space embeddings, demonstrating the implementation and usage of a vectorized database in a practical application.

Overview
Get Started
Modules
Contributors
License

Project Overview

The key components of the project include:

VecDB: A class representing the vectorized database, responsible for storing and retrieving vectors.
insert_records(): A method to insert vectors into the database.
retrieve(): A method to retrieve the top-k most similar based on a given query vector.
_cal_score(): A helper method to calculate the cosine similarity between two vectors.
_build_index(): The function responsible for the indexing.

Get Started

To get started with the project, follow these steps:

Clone the repository to your local machine.
Run the provided code, you can see the notebook for more clarification.
Customize the code and add any additional features as needed.
Run the evaluation to assess the accuracy of your implementation.

Modules

The project provides a VecDB class that you can use to interact with the vectorized database. Here's an example of how to use it:

from VecDB import VecDB

# Create an instance of VecDB
db = VecDB()

# Insert records into the database
records = [
    {
        "id": 1,
        "embed": [0.1, 0.2, 0.3, ...]  # Image vector of dimension 70
    },
    {
        "id": 2,
        "embed": [0.4, 0.5, 0.6, ...]
    },
    ...
]
db.insert_records(records)

# Retrieve similar images for a given query
query_vector = [0.7, 0.8, 0.9, ...]  # Query vector of dimension 70
similar_images = db.retrieve(query_vector, top_k=5)
print(similar_images)

The project also provides a BinaryFile class that you can use to read and write binary files. Here's an example of how to use it:

# setup data
num_rows = 1000000
vec_size = 70
# define instance of class
file_path = "data.bin"
# empty file if exists
open(file_path, 'w').close()
bfh = BinaryFile(file_path)
# create data and write to binary file
records_np = np.random.random((num_rows, vec_size))
records_dict = [{"id": i, "embed": list(row)} for i, row in enumerate(records_np)]
bfh.insert_records(records_dict)
# read and verify a single record
random_row_id = random.randint(0, num_rows - 1)
vec_ran = bfh.read_row(random_row_id)[1:]
vec_real = records_np[random_row_id]
print('Single record verification:', np.allclose(vec_ran, vec_real))
# read all records and verify
retrieved_all = bfh.read_all()
retrieved_all = retrieved_all[:,1:]  # remove id from retrieved all
print('All records verification:', np.allclose(retrieved_all, records_np))