RAG (Retrieval Augmented Generation) with local models

header

This repo wants to be an easy way to showcase how to implement RAG (Retrieval Augemented Generation) with only local models (no need for OpenAI api keys). For this example we leverage MongoDB Atlas Search as Vector Store, Hugging Face transformers to compute the embeddings, Langchain as LLM framework and a quantized version of Llama2-7B from TheBloke's.

Note You are free to experiment with different models, the main goal of this work was to get something that can run entirely on CPU with the least amount of memory possible. Bigger models will give you way better results.

Load sample data

In this example we are using the sample_mflix.movies collection, part of the sample dataset available in Atlas. Once you have uploaded the sample dataset in you Atlas Cluster, you can run encoder.py to compute vectors out of your MongoDB documents.

python3 encoder.py

Download the quantized Llama2 LLM locally from Huggingface

Download the model from hugging face by following this documentation

Define Atlas Search Index

Create the following search index on the sample_mflix.movies collection:

{
  "mappings": {
    "fields": {
      "vector": [
        {
          "dimensions": 384,
          "similarity": "cosine",
          "type": "knnVector"
        }
      ]
    }
  }
}

Add a config.py file

In the config.py file, add a field like so:

mongo_uri = "mongodb+srv://<your conn string>"

Run the streamlit app to query the LLM

Run the streamlit app with streamlit run main.py

We can then ask the LLM something like:

give me some movies about racing