/Context-based-document-search

A Python-based tool for context-based search across text documents using OpenAI embeddings and Chroma vector storage. This system enables efficient querying of document collections by generating vector embeddings, storing them persistently, and retrieving relevant results based on textual queries.

Primary LanguagePython

Contextual Documents Search

This project provides a system for performing context-based search across documents stored in a vector database. Using OpenAI's embedding models and Chroma, this tool allows you to efficiently search through a collection of text documents and retrieve the most relevant results based on a given query.

Features

  • Automatic vector embedding generation for documents stored in a specified directory.
  • Easy-to-use search functionality that finds the most contextually relevant documents.
  • Persistent vector storage using Chroma, allowing for seamless loading and updating of the database.

Prerequisites

  • Python 3.7 or higher

  • OpenAI API key

  • Install the required packages by running:

    pip install -r requirements.txt

    Installation

  1. Clone the repository:
    git clone https://github.com/your-username/contextual-documents-search.git
  2. Navigate to the project directory:
    cd contextual-documents-search
  3. Set up a virtual environment (optional but recommended):
    python -m venv venv
    source venv/bin/activate   # On Windows: venv\Scripts\activate
  4. Install dependencies:
    pip install -r requirements.txt
  5. Set up your environment variables. Create a .env file in the project root and add your OpenAI API key:
    OPENAI_API_KEY = your_openai_api_key

Usage

Initializing and Querying the Vector Database

  1. Prepare a directory of .txt files you want to search through and place them in the ./resumes folder or specify a different directory in the code.

  2. In your main script, instantiate the VectorDBHandler class and call load_or_create_db() to initialize the vector store.

    from dotenv import load_dotenv
    from vector_db_handler import VectorDBHandler
    
    # Load environment variables
    load_dotenv()
    
    # Set up directory paths and collection name
    files_directory = "./resumes"
    persist_directory = "./vector_db"
    collection_name = "resumes_collection"
    
    # Initialize the vector database handler
    vector_db_handler = VectorDBHandler(files_directory, persist_directory, collection_name)
    
    # Load or create the vector store database
    vector_db_handler.load_or_create_db()
    
    # Define the query for the search
    query = "I am looking for a software engineer with OpenAI hard skill."
    docs = vector_db_handler.query_vector_store(query)
    
    # Output the top result
    if docs:
        print("Top matching document:")
        print(docs[0].page_content)
    else:
        print("No matching documents found.")