This repo wants to be an easy way to showcase how to implement RAG (Retrieval Augemented Generation) with only local models (no need for OpenAI api keys). For this example we leverage MongoDB Atlas Search as Vector Store, Hugging Face transformers to compute the embeddings, Langchain as LLM framework and a quantized version of Llama2-7B from TheBloke's.
Note You are free to experiment with different models, the main goal of this work was to get something that can run entirely on CPU with the least amount of memory possible. Bigger models will give you way better results.
In this example we are using the sample_mflix.movies
collection, part of the sample dataset available in Atlas.
Once you have uploaded the sample dataset in you Atlas Cluster, you can run encoder.py to compute vectors out of your MongoDB documents.
python3 encoder.py
Download the model from hugging face by following this documentation
Create the following search index on the sample_mflix.movies
collection:
{
"mappings": {
"fields": {
"vector": [
{
"dimensions": 384,
"similarity": "cosine",
"type": "knnVector"
}
]
}
}
}
In the config.py file, add a field like so:
mongo_uri = "mongodb+srv://<your conn string>"
Run the streamlit app with streamlit run main.py
We can then ask the LLM something like:
give me some movies about racing