📄 PDF Query RAG

A hands-on RAG (Retrieval-Augmented Generation) application that transforms PDF documents into queryable knowledge bases

Perfect for learning RAG, LangChain, and LLM-powered document interactions

🎯 What is This Project?

This is a complete RAG application that demonstrates how to build an intelligent system that:

Uploads and processes PDF documents
Extracts and chunks text intelligently
Creates semantic embeddings for document search
Answers questions using retrieved context from PDFs
Returns human-friendly answers based on document content

Perfect for students learning:

🤖 Retrieval-Augmented Generation (RAG)
🔗 LangChain framework
💬 LLM prompt engineering
📄 PDF processing and text extraction
🔍 Vector embeddings and similarity search
🌐 Building full-stack AI applications

✨ Key Features

Feature	Description
📄 PDF Processing	Automatic text extraction and intelligent chunking
🧠 Semantic Search	Powered by OpenAI embeddings and FAISS vector store
💬 Context-Aware Answers	Uses GPT models with retrieved context from documents
🎨 Dual Interface	Both Streamlit web UI and FastAPI REST API
🔄 RAG Pipeline	Complete RAG implementation with LangChain
⚡ Real-time Processing	Upload and query documents instantly
🚀 Production Ready	Modular architecture, error handling, and best practices

🏗️ Architecture Overview

┌─────────────────┐
│   PDF Upload    │  User uploads a PDF document
└────────┬────────┘
         │
         ▼
┌─────────────────────────────────────┐
│      PDF Processing Pipeline         │
│  ┌───────────────────────────────┐  │
│  │  1. Extract text from PDF     │  │
│  └──────────────┬────────────────┘  │
│                 │                    │
│  ┌──────────────▼────────────────┐  │
│  │  2. Split into chunks         │  │
│  └──────────────┬────────────────┘  │
│                 │                    │
│  ┌──────────────▼────────────────┐  │
│  │  3. Create embeddings         │  │
│  │     and build vector store    │  │
│  └──────────────┬────────────────┘  │
└─────────────────┼────────────────────┘
                  │
                  ▼
         ┌────────────────┐
         │  User Query    │  "What is the main topic?"
         └────────┬───────┘
                  │
                  ▼
┌─────────────────────────────────────┐
│      LangChain RAG Pipeline         │
│  ┌───────────────────────────────┐  │
│  │  1. Embed query and search    │  │
│  │     for relevant chunks       │  │
│  └──────────────┬────────────────┘  │
│                 │                    │
│  ┌──────────────▼────────────────┐  │
│  │  2. Retrieve top-k chunks     │  │
│  └──────────────┬────────────────┘  │
│                 │                    │
│  ┌──────────────▼────────────────┐  │
│  │  3. LLM generates answer      │  │
│  │     using retrieved context   │  │
│  └──────────────┬────────────────┘  │
└─────────────────┼────────────────────┘
                  │
                  ▼
         ┌────────────────┐
         │  User-friendly │
         │     Answer     │
         └────────────────┘

🛠️ Tech Stack

Category	Technology	Purpose
🤖 AI/ML	LangChain	RAG pipeline orchestration
	OpenAI GPT-4o-mini	LLM for answer generation
	Sentence Transformers	Local embeddings (all-MiniLM-L6-v2)
🌐 Backend	FastAPI	REST API server
💻 Frontend	Streamlit	Interactive web interface
🗄️ Vector Store	FAISS	Efficient similarity search
📄 PDF Processing	PyPDF	PDF text extraction
⚙️ Tools	uv	Fast Python package manager
	Python 3.10+	Programming language

📦 Project Structure

rag-pdf-python/
├── backend/
│   ├── __init__.py
│   └── main.py              # 🚀 FastAPI REST endpoints
│
├── frontend/
│   ├── __init__.py
│   └── app.py               # 🎨 Streamlit web interface
│
├── shared/
│   ├── __init__.py
│   ├── config.py            # ⚙️ Configuration & environment variables
│   ├── pdf_processor.py     # 📄 PDF extraction & chunking
│   ├── vector_store.py      # 🔍 FAISS vector store management
│   └── rag.py               # 🧠 RAG query engine
│
├── uploads/                 # 📁 Uploaded PDFs directory (auto-created)
├── pyproject.toml           # 📋 Dependencies & project config
├── uv.lock                  # 🔒 Dependency lock file
└── README.md

🚀 Quick Start

Prerequisites

Python 3.10+ installed
OpenAI API Key (Get one here)
uv package manager (we'll install it if needed)

Installation Steps

1️⃣ Install uv (if needed)

curl -LsSf https://astral.sh/uv/install.sh | sh

2️⃣ Clone and Navigate

git clone https://github.com/JaimeLucena/rag-pdf-python.git
cd rag-pdf-python

3️⃣ Install Dependencies

uv sync

This will create a virtual environment and install all required packages.

4️⃣ Configure Environment

Create a .env file in the root directory:

OPENAI_API_KEY=sk-your-api-key-here
OPENAI_MODEL=gpt-4-turbo-preview
EMBEDDING_MODEL=text-embedding-3-small
CHUNK_SIZE=1000
CHUNK_OVERLAP=200

💡 Tip: Never commit your .env file! It's already in .gitignore

🎮 Usage

Step 1: Start the Backend

Launch the FastAPI backend server:

uv run uvicorn backend.main:app --host localhost --port 8000 --reload

The API will be available at http://localhost:8000

📚 Interactive docs: http://localhost:8000/docs
📖 ReDoc: http://localhost:8000/redoc

Step 2: Start the Frontend

In a new terminal, launch the Streamlit web interface:

uv run streamlit run frontend/app.py

The web UI will automatically open in your browser at http://localhost:8501

Features:

💬 Chat interface for natural language queries
📄 Drag & drop PDF upload
🎨 Clean, modern UI
📝 Chat history
✅ Real-time processing status

Option 3: Use the REST API Directly

You can also interact with the API directly using HTTP requests:

Upload a PDF

curl -X POST "http://localhost:8000/upload" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@document.pdf"

Response:

{
  "message": "PDF uploaded and processed successfully",
  "chunks": 42,
  "filename": "document.pdf"
}

Query the Document

curl -X POST "http://localhost:8000/query" \
  -H "Content-Type: application/json" \
  -d '{"question": "What is the main topic of this document?"}'

Response:

{
  "answer": "The main topic of this document is..."
}

Health Check

curl http://localhost:8000/health

Response:

{
  "status": "healthy"
}

💡 Example Questions

Try asking these questions after uploading a PDF:

Basic Queries

"What is this document about?"
"Summarize the main points"
"What are the key findings?"
"List the main topics"

Specific Information

"What is mentioned about [topic]?"
"Who are the authors?"
"What are the conclusions?"
"What methodology was used?"

Detailed Queries

"Explain the process described in the document"
"What are the recommendations?"
"What data or statistics are mentioned?"
"What are the limitations discussed?"

🧠 How RAG Works Here

Step-by-Step Process

PDF Upload → User uploads a PDF document
```
document.pdf uploaded
```
Text Extraction → Extract all text from PDF pages
```
Extracted 15,000 characters from 10 pages
```
Chunking → Split text into overlapping chunks
```
Created 42 chunks of ~1000 characters each
```

Embedding → Create vector embeddings for each chunk

Generated 42 embeddings using sentence transformers

Vector Store → Build FAISS index for similarity search
```
FAISS index created with 42 vectors
```
Query Processing → User asks a question
```
"What is the main topic?"
```

Retrieval → Find most relevant chunks

Retrieved top 5 chunks with highest similarity

Generation → LLM generates answer using context
```
"Based on the document, the main topic is..."
```

Key Components

shared/pdf_processor.py: PDF text extraction and chunking
- Extracts text from PDF pages
- Intelligent chunking with sentence boundaries
- Configurable chunk size and overlap
shared/vector_store.py: Vector embeddings and search
- FAISS index for fast similarity search
- Sentence transformer embeddings
- Top-k retrieval
shared/rag.py: RAG query engine
- Combines retrieval and generation
- Context-aware prompt engineering
- OpenAI GPT integration

📊 Configuration

You can customize the application behavior through environment variables:

Variable	Description	Default
`OPENAI_API_KEY`	Your OpenAI API key	Required
`OPENAI_MODEL`	GPT model for answers	`gpt-4-turbo-preview`
`EMBEDDING_MODEL`	Model for embeddings	`text-embedding-3-small`
`CHUNK_SIZE`	Text chunk size	`1000`
`CHUNK_OVERLAP`	Overlap between chunks	`200`

🎓 Learning Objectives

By exploring this project, you'll learn:

✅ RAG Fundamentals

How to combine retrieval (vector search) with generation (LLM)
Building end-to-end RAG pipelines
Context-aware answer generation

✅ LangChain Patterns

Creating custom chains
Prompt engineering
LLM integration

✅ Vector Embeddings

Creating document embeddings
Similarity search with FAISS
Retrieval strategies

✅ PDF Processing

Text extraction from PDFs
Intelligent text chunking
Handling different document formats

✅ Full-Stack AI Apps

Building APIs for AI services
Creating interactive UIs
Managing state and sessions

✅ Best Practices

Modular code organization
Environment configuration
Error handling
Type hints and documentation

🔧 Development

Running Tests

uv run pytest

Code Formatting

uv run ruff format .
uv run ruff check .

🤔 Common Questions

Q: Why FAISS instead of other vector databases?

A: FAISS is perfect for learning - it's fast, in-memory, and requires no setup. For production with persistence, consider Pinecone, Weaviate, or Qdrant.

Q: Can I use a different LLM?

A: Yes! LangChain supports many providers. Just change the LLM initialization in shared/rag.py and update your API key.

Q: How do I persist the vector store?

A: The VectorStore class has save() and load() methods. You can modify the backend to persist embeddings between sessions.

Q: Is this production-ready?

A: This is a learning project. For production, add authentication, rate limiting, logging, monitoring, and persistent vector storage.

Q: What PDF formats are supported?

A: Currently supports standard PDFs with extractable text. Scanned PDFs (images) would require OCR preprocessing.

📚 Additional Resources

📝 License

MIT License - see LICENSE file for details

🙏 Acknowledgments

Built with ❤️ for students learning AI and generative models.

Happy Learning! 🚀

Made with ❤️ for the AI learning community

⭐ Star this repo if you found it helpful!