A RAG (Retrieval-Augmented Generation) service that provides intelligent Q&A capabilities for company handbooks using local LLM with CUDA/MPS support.
- PDF document processing and embedding
- Local LLM support with GGUF models
- CUDA and MPS (Apple Silicon) support
- Vector similarity search using Qdrant
- Async task queue for document processing
- REST API with OpenAI-compatible endpoints
- Bearer token authentication
- Streaming response support
- Aliyun LLM fallback support
- Python 3.12+
- Poetry for dependency management
- CUDA toolkit (for NVIDIA GPUs) or MPS (for Apple Silicon)
- Qdrant vector database
- Local GGUF model files
-
Clone the repository:
git clone <repository-url> cd handbook-rag
-
Install dependencies using Poetry:
poetry install
-
For CUDA support, install llama-cpp-python with CUDA:
CMAKE_ARGS="-DLLAMA_CUBLAS=on" poetry run pip install llama-cpp-pythonOr for MPS support on macOS:
CMAKE_ARGS="-DLLAMA_METAL=on" poetry run pip install llama-cpp-python -
Set up Qdrant:
docker run -p 6333:6333 -v $(pwd)/qdrant_data:/qdrant/storage qdrant/qdrant
Create a .env file in the project root:
# API Settings
API_TOKEN=your-secret-token-here
HOST=0.0.0.0
PORT=8000
# Device Settings
DEVICE=cuda # or mps, cpu
MAX_GPU_MEMORY=4GiB
# Model Settings
MODEL_DIR=./models
LOCAL_MODEL_PATH=./models/qwen2.5-7b-instruct
MODEL_PARTS_PATTERN=qwen2.5-7b-instruct-q4_0-{:05d}-of-{:05d}.gguf
MODEL_PARTS_COUNT=2
# Vector DB Settings
QDRANT_HOST=localhost
QDRANT_PORT=6333import asyncio
from handbook_rag.bootstrap import RAGService
async def main():
service = RAGService()
await service.initialize()
# Process PDF
await service.ensure_embeddings("path/to/handbook.pdf")
# Query the handbook
response = await service.process_query("What is the vacation policy?")
print(response)
await service.shutdown()
if __name__ == "__main__":
asyncio.run(main())-
Start the server:
poetry run python -m handbook_rag.api
-
Upload a PDF:
curl -X POST "http://localhost:8000/v1/embed" \ -H "Authorization: Bearer your-secret-token-here" \ -F "file=@handbook.pdf"
-
Query the handbook:
curl -X POST "http://localhost:8000/query" \ -H "Authorization: Bearer your-secret-token-here" \ -H "Content-Type: application/json" \ -d '{"query": "What is the vacation policy?"}'
-
Use OpenAI-compatible endpoint:
curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Authorization: Bearer your-secret-token-here" \ -H "Content-Type: application/json" \ -d '{ "messages": [ {"role": "user", "content": "What is the vacation policy?"} ], "stream": true }'
handbook-rag/
├── handbook_rag/
│ ├── api/ # REST API implementation
│ ├── embeddings/ # PDF processing and embedding
│ ├── llm/ # LLM implementations
│ └── queue/ # Async task queue
├── tests/ # Test files
├── pyproject.toml # Project dependencies
└── README.md # This file
-
Format code:
poetry run black . poetry run isort .
-
Run tests:
poetry run pytest
The service is designed to work with GGUF format models. By default, it's configured to use Qwen 2.5 7B Instruct, but you can use any GGUF model by updating the configuration.
The model URL is: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF
We are using the qwen2.5-7b-instruct-q4_0*.gguf quantized model.
The service supports split GGUF models. Configure the following in your .env:
MODEL_PARTS_PATTERN: Pattern for split files (e.g., "model-q4_0-{:05d}-of-{:05d}.gguf")MODEL_PARTS_COUNT: Total number of parts
MIT License
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request