This repository contains code to build a question-answering system around the laws and tactics of croquet using Retrieval-Augmented Generation (RAG). The system processes PDF documents, indexes them for easy retrieval, and answers questions based on the content. It utilizes a combination of MistralAI for language modelling, OpenAI embeddings for vector storage, and LangChain for document splitting and retrieval.
Ensure you have the following Python packages installed:
pip install -qU pypdf langchain_community langchain_chroma langchain_openai langchain-mistralai
Set up your .env
file to manage your API keys. The following environment variables are required:
OPENAI_API_KEY
: Your OpenAI API key.MISTRAL_API_KEY
: Your Mistral AI key.
You can create the .env
file with the following content:
OPENAI_API_KEY=your_openai_api_key_here
MISTRAL_API_KEY=your_mistral_api_key_here
Load the environment variables with the following command in your script:
from dotenv import load_dotenv
import os
load_dotenv()
- Place all your croquet-related PDF files in a directory called
croquet_pdfs
. - Run the script to process the PDFs, generate embeddings, and set up the RAG pipeline for question answering.
- You can invoke the question-answering system by passing a query like this:
results = rag_chain.invoke({"input": "Explain what a B Baulk Tice opening is and why it is used."})
print(results["context"][0].page_content)
print(results["context"][0].metadata)
The system will retrieve relevant information from the PDF documents and generate a concise answer.
- PDF Processing: Uses
langchain_community.PyPDFLoader
to load and process all PDFs in thecroquet_pdfs
folder. - Document Splitting:
RecursiveCharacterTextSplitter
breaks down the documents into manageable chunks. - Embeddings: The OpenAI embeddings model converts text into vector embeddings.
- Vector Storage:
Chroma
is the vector storage engine to store and retrieve document embeddings. - RAG Pipeline: Combines document retrieval and Mistral's LLM to provide relevant answers based on the retrieved context.
RAG (Retrieval-Augmented Generation) models combine traditional retrieval-based methods with generation capabilities to answer questions. The system first retrieves relevant documents, then uses a language model to generate an answer based on the retrieved information.
The system processes croquet-related PDFs, splits them into smaller chunks for efficient retrieval, and stores embeddings in a vector database. This allows for fast question-answering based on the content of the documents.
This project is licensed under the MIT License.