HumanAIze Hackathon: AI-powered GenAI Chatbot for Legal Assistance

This repository contains an AI-powered chatbot designed to assist individuals with limited knowledge of legal matters. The chatbot provides information on laws and penal codes, along with real-life case examples, using advanced AI techniques including Retrieval-Augmented Generation (RAG) and Hypothetical Document Embeddings (HyDE).

Task Description

The goal is to create an AI-based question-answering system that can provide accurate and contextually relevant answers based on legal documents and real-life cases. The system should:

  • Retrieve relevant legal passages based on the user's questions.
  • Generate precise and contextually appropriate answers using the retrieved information.
  • Enhance the quality of retrieved passages and generated answers using HyDE to improve embedding representations.

Proposed Solution: RAG with Hypothetical Document Embeddings (HyDE)

Retrieval with HyDE

  • HyDE Involvement: Utilizes a large language model (LLM) to generate a "hypothetical document" that captures the essence of the user's query.
  • Embedding and Retrieval: The hypothetical document is encoded into a vector representation and used to search the document embedding space, retrieving passages most similar to the hypothetical document.

Answer Generation

  • Using Retrieved Passages: The retrieved passages are fed into the Generator component of RAG, another LLM.
  • Formulating Answers: The LLM analyzes the retrieved passages and the user's original question to formulate a precise and contextually relevant answer.

Enhancing Performance with HyDE

  • Traditional vs. HyDE Retrieval: Traditional retrieval methods might struggle to capture the full context of the question. HyDE's hypothetical documents provide a more nuanced representation of the needed information.

Setup Instructions

Prerequisites

Ensure that you have the following installed on your machine:

  • Python 3.x
  • Git

Steps to Setup

  1. Clone the Repository

    git clone https://github.com/Darshanroy/HumanAIze-FinTech-Hackathon.git
  2. Navigate to the Project Directory

    cd HumanAIze-FinTech-Hackathon
  3. Install the Required Dependencies

    pip install -r requirements.txt
  4. Run the Streamlit App

    streamlit run streamlit-app.py

Usage

Initializing Document Embedding

  1. Initialize Document Embedding:
    • Click the "Initialize Document Embedding" button. This process will load the documents and create the necessary embeddings. A message will be displayed indicating that the vector store database is ready.

Asking Questions

  1. Enter Your Question:
    • Input your question in the text box labeled "Enter your question based on the documents".
    • Click the "Submit Question" button to submit your question.

Viewing the Response

  1. Response Time:

    • The response time for processing the question will be displayed.
  2. Answer:

    • The answer generated by the model will be shown below the response time.
  3. Document Similarity Search:

    • Expand the "Document Similarity Search" section to view the documents that were most relevant to your question. This section will display the content of these documents.

Technical Approach and Implementation Details

Imports

Necessary libraries for Streamlit, environment variables, and Langchain components are imported.

Environment Variables and Streamlit Title

Environment variables (like API tokens) are loaded using dotenv. The Streamlit app's title is set to "AI-powered GenAI Chatbot for Legal Assistance using RAG and HYDE techniques".

Initializing Embeddings and Models

  • HuggingFaceEndpointEmbeddings are loaded for sentence embedding using the sentence-transformers/all-MiniLM-L6-v2 model.
  • Two HuggingFaceEndpoint instances are created:
    • mistral_llm: For text generation using the mistralai/Mistral-7B-Instruct-v0.3 model.
    • mistral_hyde_embeddings: For generating hypothetical document embeddings using mistral_llm and the loaded sentence embeddings with the web_search prompt.

Chat Prompt Template

qa_prompt_template defines a template for the question-answering prompt provided to mistral_llm. It includes context and the user's question.

prepare_vector_store Function

This function initializes the vector store database:

  • Stores the mistral_hyde_embeddings in the session state.
  • Loads documents using UnstructuredCSVLoader from docs/Lex/podcastdata_dataset.csv.
  • Splits documents using RecursiveCharacterTextSplitter for efficient processing.
  • Creates a Chroma vector store from the first 10,000 split documents and the mistral_hyde_embeddings.

User Input and Contextualization

  • user_question is a text input field for the user to enter their question.
  • contextualize_q_prompt reformulates the user's question into a standalone format without requiring the chat history.

Chat History Management

get_chat_session_history retrieves the chat history for the current session (identified by session_id).

Main Loop

The loop continues until the user enters 'q' to quit. Inside the loop:

  • If the user enters a question (except 'q'):
    • A question-answer chain (question_answer_chain) is created using create_stuff_documents_chain with the mistral_llm and the qa_prompt_template.
    • A history-aware retriever (history_aware_retriever) is created using create_history_aware_retriever. This retriever leverages the mistral_llm to contextualize the user's question based on the document embeddings.
    • A retrieval chain (retrieval_chain) is created using create_retrieval_chain to combine the history-aware retriever and the question-answer chain.
    • A conversational RAG chain (conversational_rag_chain) is created using RunnableWithMessageHistory. This chain manages the chat history and uses the retrieval chain to answer questions based on the documents.
    • Response time is measured using time.process_time before invoking the conversational_rag_chain with the user's question and a session ID.
    • The retrieved answer is displayed along with the response time.
    • An expander section allows viewing the document similarity search results (top retrieved documents based on the user's question).

Data Sources and Preprocessing Steps

Data Source

  • Authoritative Indian law books and real-life case data

Preprocessing Steps

  • Calculated basic contextual length
  • Removed any unusual symbols or outliers

Challenges Faced and How They Were Addressed

1. LLM & Embedding Model Selection

  • Initial Exploration: OllamaIndex (LLM), Hugging Face (LLM), Langchain (other operations)
  • Challenges: Loading models from different sources, slow Ollama embeddings, Google Generative AI embeddings struggling with large document volumes
  • Solution: Hugging Face Langchain library, Sentence Transformers for embedding

2. Data Loading

  • Used Unstructured CSV loader for efficiency

3. Vector Storage Comparison

  • FAISS vs. Chroma
  • Selection: FAISS for dense vector similarity search

4. Retriever Selection

  • Explored Parent Document Retriever and Ensemble Retriever
  • Selection: Standard Retriever for practicality

5. Evaluation

  • Used unlabeled evaluation metrics within Langchain
  • Pairwise string and embedding evaluations to identify the best model pair

Notes

  • If you want to stop the application, you can do so by closing the terminal or command prompt window where the Streamlit app is running.
  • To ask a new question, simply enter it in the text box and click the "Submit Question" button again.