The Retrieval-Augmented Generation (RAG) Project is designed to transform how we interact with document data, offering a streamlined approach to analyze, vectorize, and comprehend files using state-of-the-art technologies. At the heart of RAG is Weaviate, an AI-powered vector database that facilitates efficient document vectorization. The project leverages the Langchain framework for creating robust data pipelines and Streamlit for crafting interactive user interfaces.
- PDF Upload: Securely upload PDF documents to be processed.
- Text Extraction: Utilize advanced algorithms to extract text from PDFs, breaking down content into manageable segments.
- Chunk Storage: Efficiently store extracted text chunks in Weaviate, ensuring quick retrieval and organization.
- Embeddings Retrieval: Generate and retrieve document embeddings, enabling deep semantic search and analysis.
- LLM Integration: Seamlessly integrate with Large Language Models for enhanced comprehension and generation tasks.
Ensure you have Python 3.x installed on your machine. This project relies on several advanced Python libraries, including Langchain, Weaviate, and Streamlit, to provide a comprehensive document analysis and vectorization solution.
-
Clone the Repository
Start by cloning the RAG project repository to your local machine:
git clone <repository-url>
-
Install Dependencies
Navigate to the project directory and install the required Python libraries:
cd path/to/rag-proect[justRagit] pip install -r requirements.txt
Once installation is complete, you're ready to run the main application:
python src/main.py --pdf_file="path/to/your/document.pdf"
Retrive text data based on given query. Uses gpt-3.5-turbo-0613 model to generate answer index two sample files from directory named 'pdf' returns the answers based on top 5
use streamlit run gui.py
Video for second milestone: https://www.youtube.com/watch?v=HD_PS3HMkCk