/semantic-prompt-cache

This app leverages Semantic Caching to minimize inference latency and reduce API costs by reusing semantically similar prompt responses.

Primary LanguagePython

RAG + Semantic Cache System

This project is designed to enhance a Retrieval-Augmented Generation (RAG) pipeline with a custom-built Semantic Cache system. The primary goal is to reduce redundant LLM (Large Language Model) calls, improve system responsiveness, and optimize cost for real-time and large-scale AI applications.

πŸš€ Purpose

In traditional RAG pipelines, every user query is processed through document retrieval and LLM generationβ€”even if a semantically similar query was already answered. This approach increases latency and inflates API usage costs.

This system introduces a semantic caching layer that intercepts incoming queries and compares themβ€”based on meaning, not just keywordsβ€”against previously answered queries. If a sufficiently similar query is found, the cached response is reused, bypassing the need for another LLM call.

πŸ”§ Use Cases

  • Chatbots with memory efficiency
    Minimize repeated LLM calls for frequently asked or rephrased questions.

  • Enterprise knowledge assistants
    Provide consistent and faster answers to similar user queries across departments.

  • High-throughput RAG pipelines
    Scale to thousands of queries per day while maintaining performance and reducing cost.

  • Latency-sensitive applications
    Reduce end-user wait time by short-circuiting the full RAG flow when a cached response is available.

Semantic Cache for LLM-Enhanced RAG

A modular, non-OOP semantic caching system built to reduce LLM calls and latency in Retrieval-Augmented Generation (RAG) pipelines.

πŸ”§ Features

  • βœ… Embeds user queries using bge-small-en-v1.5
  • βœ… Stores query-response pairs with FAISS index
  • βœ… Retrieves cached results based on semantic similarity
  • βœ… Configurable similarity threshold
  • βœ… Supports metadata (timestamps, hits) and leaderboard extensions
  • βœ… Fully functional with Mistral (via Groq) or any OpenRouter-compatible LLM
  • βœ… Enterprise knowledge assistants (e.g. Azure Docs)
  • βœ… High-throughput RAG pipelines
  • βœ… Latency-sensitive LLM apps

🧱 Architecture Overview

            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚        User Query Input       β”‚
            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό
     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚ 1. Check Semantic Cache (FAISS)       β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚ Yes (high match)   β”‚ No (miss)
         β–Ό                    β–Ό
  Reuse Cached LLM     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
      Response         β”‚ 2. Retrieve Context β”‚
                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
                               β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚ 3. Build Prompt + Inject Docs  β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
                               β–Ό
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚ 4. Generate Response (Mistral LLM) β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
                               β–Ό
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚ 5. Postprocess + Store in Cache    β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“ Key Modules

Module Purpose
semantic_cache/embedder.py Loads BGE model and returns query embeddings
semantic_cache/index_manager.py Manages FAISS index creation, loading, saving
semantic_cache/operations.py Handles get/set/clear cache operations
rag/retriever.py Top-k document retrieval from Azure knowledge base
rag/prompt_builder.py Combines retrieved chunks + user question into LLM prompt
rag/llm_client.py Calls Mistral via Groq using LangChain
rag/ingest_docs.py Preprocesses and uploads local docs into FAISS vectorstore
tests/ Unit tests for all core functionality

πŸš€ Usage (Example)

from semantic_cache.operations import get_from_cache, set_in_cache

query = "top places to visit in France"
cached = get_from_cache(query)

if cached:
    print("βœ… Cache Hit:", cached)
else:
    response = "Paris, Lyon, Nice..."  
    set_in_cache(query, response)

Run Tests

pytest tests/

πŸ“Œ Next Steps

πŸ” Add leaderboard and TTL/size-based cache trimming

πŸ“š Ingest Azure PDF documentation automatically

🌐 Wrap with FastAPI for API serving

☁️ Upgrade from FAISS β†’ Qdrant/Chroma

πŸ€– Migrate from Groq to AI Foundry (multi-LLM orchestration)