A React-Flask application that implements Retrieval-Augmented Generation (RAG) for document-based conversations. This application allows users to upload documents (PDFs, CSVs) and chat with an AI that can reference and retrieve information from these documents.
- 💬 Interactive chat interface
- 📁 Document upload and processing
- 🔍 Retrieval-augmented responses
- Vector search
- Semantic chunking
- Summarization
- Tools
- Document tool use
- Data science tool use
- 🗂️ Session management
- ⚙️ Configurable model settings
- 🎨 Modern, responsive UI
- 📊 Comprehensive RAG evaluation system
- Automated QA pair generation
- Human curation interface
- Multi-metric evaluation (programmatic, LLM-based, and hybrid) with option for human eval
- Detailed performance analytics
- Flask
- LlamaIndex
- Chromadb
- Ollama
- Sentence Transformers
- spaCy
- React
- Vite
- TailwindCSS
- React Router
- Headless UI
- Lucide Icons
- Python 3.10+
- Node.js 16+
- npm/yarn
- Ollama with models installed (Llama, Qwen)
- spaCy English model (
python -m spacy download en_core_web_sm
)
💡 Resource Note: Running a full suite of ablation experiments with 2-3 different models costs roughly $4 in electricity - that's less than a coffee! Perfect for researchers and hobbyists looking to experiment with state-of-the-art RAG systems on consumer hardware.
- Clone the repository
git clone [your-repo-url]
cd [repo-name]
- Create and activate virtual environment
python -m venv rag_app_env
source rag_app_env/bin/activate # On Windows: .\rag_app_env\Scripts\activate
- Install Python dependencies
pip install -r requirements.txt
python -m spacy download en_core_web_sm
- Navigate to frontend directory
cd rag-frontend
- Install dependencies
npm install
The application includes sophisticated data generation capabilities for training and evaluation:
- Cross-document Topic Analysis: Automatically identifies and correlates related topics across multiple documents
- Semantic Chunking: Intelligent text segmentation based on topic coherence
- Multi-stage Processing:
- Topic Identification
- Section Extraction
- QA Pair Generation
- Human Curation Interface
flowchart TD
subgraph Input
Docs[Documents] --> TopicAnalysis[Topic Analysis]
end
subgraph Processing
TopicAnalysis --> Topics[Topic Identification]
Topics --> Sections[Section Extraction]
Sections --> CrossDoc[Cross-document Correlation]
CrossDoc --> QAPairs[QA Pair Generation]
end
subgraph Curation
QAPairs --> HumanReview[Human Review]
HumanReview --> Approved{Approved?}
Approved -->|Yes| Final[Final QA Pairs]
Approved -->|No| Modify[Modify QA]
Modify --> HumanReview
end
- Automated Topic Discovery: Identifies key topics and their relationships
- Cross-document Aggregation: Combines related information across multiple sources
- Hierarchical Processing:
- Topic Collection
- Section Identification
- Task Generation
- Quality Assurance
flowchart TD
Doc[Documents] --> Topics[Topic Collection]
Topics --> Sections[Section Identification]
Sections --> Combine[Cross-document Aggregation]
Combine --> Generate[Task Generation]
Generate --> QA[QA Extraction]
subgraph "Quality Checks"
QA --> Confidence[Confidence Scoring]
Confidence --> KeyPoints[Key Points Extraction]
KeyPoints --> HumanReview[Human Review]
end
- Automated progress saving during processing
- Resume capability for interrupted operations
- Progress tracking and reporting
The repository includes a script (finetuning/unsloth.py
) for finetuning LLMs on generated QA pairs:
- Uses Unsloth's optimization techniques for efficient training
- Supports LoRA adapters for parameter-efficient finetuning
- Configurable for different model sizes and architectures
- Includes chat template formatting for consistent training
- Features:
- 4-bit quantization support
- Gradient checkpointing
- Custom chat template application
- Inference utilities
- Start the Flask backend
# In the root directory
python app.py
- Start the React frontend
# In the rag-frontend directory
npm run dev
- Access the application at
http://localhost:5173
- Build the frontend
cd rag-frontend
npm run build
- Run with Docker Compose
docker-compose up --build
-
Start a New Session
- Click the "+" button in the sidebar
- Enter a session name
-
Upload Documents
- Click "Upload Files" button
- Select PDF or CSV files
- Wait for processing confirmation
-
Chat Interface
- Type messages in the input field
- View responses in the chat window
- Switch between sessions using the sidebar
-
Configure Settings
- Click the settings icon
- Select different models and embeddings
- Save changes
- RAG Evaluation System
- Generate QA pairs from documents:
python qa_generator.py -i /path/to/docs -o qa_pairs.json python summarization_qa_generator.py -i /path/to/docs -o summary_qa_pairs.json # use in conjunction with an agent to provide more accurate curation
- Curate generated QA pairs manually in the JSON file
- Generate agent responses:
python response_generator.py -q qa_pairs_curated.json -d docs/
- Run evaluation metrics:
python metrics_eval.py -q qa_pairs_curated_with_responses.json
- Generate QA pairs from documents:
stateDiagram-v2
[*] --> NewSession: Create Session
NewSession --> Upload: Upload Documents
Upload --> Processing: Process Files
Processing --> Ready: Index Ready
Ready --> Chat: Start Chat
Chat --> Query: User Query
Query --> Retrieval: Get Context
Retrieval --> Generation: Generate Response
Generation --> Chat: Show Response
Chat --> Ready: New Query
Chat --> Upload: Add Documents
Ready --> NewSession: Switch Session
.
├── agent_networks
│ ├── agents_full_suite
│ │ ├── custom.py
│ │ └── react.py
│ ├── agents_search_only
│ │ ├── custom.py
│ │ └── react.py
│ ├── __init__.py
│ ├── naive_rag
│ │ ├── naive_rag.py
├── data
│ ├── chromadb
│ │ ├── chromadb
│ │ ├── default_vector_store
│ │ └── document_store
│ ├── document1.pdf
│ │ ├── default__vector_store.json
│ │ ├── docstore.json
│ │ ├── graph_store.json
│ │ ├── image__vector_store.json
│ │ └── index_store.json
│ └── storage
├── dataset_generation
│ ├── json-transform.py
│ ├── qa_generator.py
│ └── summarization_qa_generator.py
├── docker-compose.yaml
├── Dockerfile
├── finetuning
│ └── unsloth.py
├── logging_config.py
├── logs
│ ├── app.log
│ └── summary_qa_extraction.log
├── main.py
├── metrics_eval.py
├── queue.sh
├── rag-frontend
├── README.md
├── reports
│ ├── comparative_plot.py
├── requirements.txt
├── response_generator.py
├── sessions
├── utils.py
graph TB
subgraph Frontend
UI[Chat Interface]
Settings[Settings Panel]
Upload[File Upload]
end
subgraph Backend
API[Flask API]
DB[(ChromaDB)]
RAG[RAG Engine]
Tools[Tool Suite]
end
subgraph Models
LLM[LLM/Ollama]
Embed[Sentence Transformers]
NLP[spaCy]
end
UI --> API
Settings --> API
Upload --> API
API --> RAG
RAG --> DB
RAG --> Tools
RAG --> LLM
RAG --> Embed
RAG --> NLP
DB --> RAG
flowchart TD
Upload[Document Upload] --> Process[Process Document]
Process --> Parse[Parse Content]
Parse --> Chunk[Chunk Text]
Chunk --> Embed[Generate Embeddings]
Embed --> Store[Store in ChromaDB]
subgraph Processing Options
Parse -->|PDF| ExtractPDF[Extract Text]
Parse --> |CSV| ParseCSV[Parse Data]
Parse --> |TXT| ReadText[Read Content]
end
subgraph Chunking Strategy
Chunk --> Semantic[Semantic Chunks]
Chunk --> Fixed[Fixed Size]
Chunk --> Overlap[With Overlap]
end
The system implements a modular retrieval architecture with three distinct strategies:
-
Naive Retrieval (
RetrieverType.NAIVE
)- Pure vector similarity search
- Uses VectorStoreIndex with configurable
similarity_top_k
- Best for simple queries where semantic similarity is sufficient
-
Naive with Reranking (
RetrieverType.NAIVE_RERANKER
)- Vector similarity search followed by cross-encoder reranking
- Uses SentenceTransformerRerank with "cross-encoder/ms-marco-MiniLM-L-2-v2"
- Improves result relevance by reranking initial candidates
- Default retrieval strategy
-
Hybrid Search (
RetrieverType.HYBRID
)- Combines vector similarity and keyword-based (BM25) search
- Supports "AND" mode (intersection of results) and "OR" mode (union of results)
- Best for queries that benefit from both semantic and lexical matching
- Includes reranking step for final result refinement
Usage:
from retrievers import RetrieverType
# Initialize with specific retriever type
agent = AgentNetwork(retriever_type=RetrieverType.HYBRID)
# Use default (naive with reranking)
agent = AgentNetwork()
# Configure retriever parameters
query_engine = get_retriever(
RetrieverType.HYBRID,
nodes,
similarity_top_k=15, # Number of initial candidates
rerank_top_n=3, # Number of results after reranking
mode="OR" # Hybrid search mode ("AND" or "OR")
)
The retrieval system is designed to be modular and easily extensible, allowing for:
- Easy switching between retrieval strategies
- Consistent interface across different implementations
- Simple addition of new retrieval methods
- Fine-tuning of parameters for specific use cases
Each retriever implementation maintains the same core workflow while optimizing for different retrieval scenarios, making it easy to experiment with different approaches for various types of queries and document collections.
The system employs a comprehensive evaluation framework combining programmatic, LLM-based, and hybrid metrics to assess response quality.
- Identifies important non-stop words from context
- Measures overlap of key terms between response and ground truth
- Considers domain-specific vocabulary and technical terms
- Useful for assessing technical accuracy and domain knowledge
- Measures what percentage of ground truth tokens are captured
- Focuses on information completeness
- Higher scores indicate comprehensive coverage
- Helps identify incomplete or partial answers
- LLM evaluates factual accuracy of response
- Considers contradiction and consistency with ground truth
- Scores from 0 (completely false) to 1 (completely truthful)
- Critical for ensuring factual correctness
- Assesses whether all key points from ground truth are covered
- Identifies missing important information
- Considers both explicit and implicit information
- Important for comprehensive answers
- Evaluates quality of retrieved context chunks
- Emphasizes presence of crucial information
- Weights heavily toward best source (80/20 split if highly relevant source found)
- Success criteria: at least one highly relevant source, even if others are less relevant
- Evaluates if response stays true to provided context
- Identifies any statements contradicting or unsupported by context
- Focuses purely on factual consistency, not completeness
- Important for preventing hallucination or incorrect inferences
- Combines semantic similarity with precision/recall measurement
- Uses embedding-based matching of key points between response and ground truth
- Matches points using similarity threshold (0.8) to allow for paraphrasing
- Calculates F1 score based on matched points
- More flexible than exact matching but still maintains accuracy
- Lower threshold (0.6) accounts for valid alternative phrasings
- Novel metric measuring potential improvements over ground truth
- Evaluates response completeness relative to full context, not just ground truth
- Process:
- Extracts ALL relevant points from context given the question
- Measures coverage of these points in both ground truth and response
- Calculates relative gain in coverage
- Normalizes score where:
- 0.5 means equal coverage to ground truth
-
0.5 means better coverage than ground truth
- <0.5 means worse coverage than ground truth
- Threshold of 0.501 chosen to identify any cases where agent provided better coverage
- Includes semantic verification of extra points claimed by response
- Particularly valuable for:
- Identifying areas where curated answers could be improved
- Measuring agent's ability to provide more comprehensive answers
- Understanding context utilization effectiveness
- Example case:
Question: "What industries use AI?" Context: [Details about AI in healthcare, finance, retail, manufacturing, education] Ground Truth: "AI is used in healthcare, finance, and retail." Response: "AI is used in healthcare, finance, retail, manufacturing, and education." Result: Completeness gain > 0.5 as response covers more relevant industries from context
- Combines embedding similarity with LLM judgment
- Embedding similarity: Uses sentence-transformers model
- LLM assessment: Evaluates semantic meaning and relevance
- Final score: Weighted average (50/50) of embedding and LLM scores
- Effective for capturing meaning beyond lexical similarity
For all metrics except Semantic F1 and Completeness Gain:
- Excellent: > 0.8 (80%)
- Good: 0.6-0.8 (60-80%)
- Fair: 0.4-0.6 (40-60%)
- Poor: < 0.4 (40%)
For Semantic F1:
- Excellent: > 0.7 (70%)
- Good: 0.5-0.7 (50-70%)
- Fair: 0.2-0.5 (20-50%)
- Poor: < 0.2 (20%)
For Completeness Gain:
- Anything above 0.5 (50%) is considered a gain
Note that thresholds can be adjusted based on specific use cases and requirements. Critical applications may require higher thresholds.
The final evaluation combines metric scores weighted by importance:
- Programmatic metrics: 25% of total score
- LLM-based metrics: 45% of total score
- Hybrid metrics: 30% of total score
The aggregate pass rate requires 6/8 (excluding completeness gain as it is a gain metric) passed metrics (can be adjusted via OVERALL_PASS_THRESHOLD) Apart from that, numerical accuracy hybrid metric is also reported indepedently. This metric describes how accurate the agent is when asked to retrieve a specific number.
This balanced approach ensures consideration of:
- Objective lexical measures
- Semantic understanding
- Context utilization
- Potential improvements over ground truth
flowchart TD
Query[User Query] --> Analyze[Analyze Query]
Analyze --> |Need Tools| ToolSelect[Select Tools]
Analyze --> |Direct Answer| Response[Generate Response]
ToolSelect --> Doc[Document Tool]
ToolSelect --> Data[Data Science Tool]
ToolSelect --> Search[Search Tool]
Doc --> Process[Process Results]
Data --> Process
Search --> Process
Process --> Response
Response --> Format[Format Output]
Format --> User[Show to User]
flowchart LR
Doc[Documents] --> Gen[QA Generator]
Gen --> Cur[Human Curation]
Cur --> Agent[Agent Response]
Agent --> Eval[Evaluation]
subgraph Evaluation Metrics
Eval --> P[Programmatic]
Eval --> L[LLM-based]
Eval --> H[Hybrid]
P --> KT[Key Terms]
P --> TR[Token Recall]
L --> T[Truthfulness]
L --> C[Completeness]
L --> S[Source Relevance]
L --> CF[Context Faithfulness]
H --> SF[Semantic F1]
H --> CG[Completeness Gain]
H --> AR[Answer Relevance]
end
GET /chat
- Retrieve chat sessions and historyPOST /chat
- Send messages and process responsesGET /settings
- Get current settingsPOST /settings
- Update settingsGET /new_session
- Create new chat sessionPOST /new_session
- Create new chat sessionPOST /delete_session/<session_id>
- Delete a sessionPOST /switch_session/<session_id>
- Switch between sessions
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
Copy## Citation
If you use RAG Playground in your research, please cite this paper:
@misc{papadimitriou2024ragplaygroundframeworksystematic,
title={RAG Playground: A Framework for Systematic Evaluation of Retrieval Strategies and Prompt Engineering in RAG Systems},
author={Ioannis Papadimitriou and Ilias Gialampoukidis and Stefanos Vrochidis and Ioannis and Kompatsiaris},
year={2024},
eprint={2412.12322},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2412.12322},
}
GPL