RAG using HyDE with completely open-source, self-hostable models.
TLDR; check folder 7 for the final app.
- Vector DB - Milvus - Dockerized
- Attu GUI for Milvus - Attu - Dockerized
- Embedding Model - MixedBread - Embed Large model - Chosen because it is one of the top models on the embedding leaderboard.
- Serving LLM - Vllm
- Onnxruntime GPU - Onnxruntime GPU (Optional: speeds up embeddings)
- Llamaindex - Llamaindex - Provides the RAG framework
- Streamlit - Streamlit - For the UI
- OpenAI API - Optional: useful to isolate bugs from self-hosting models
The major components of this system will be the VectorDB, LLM inference, and the embedding model. The Streamlit application is quite light in comparison.
By choosing scalable bases for the major components, the entire setup will be inherently scalable.
-
Milvus: Supports using a Kubernetes (k8s) cluster, allowing us to set scaling rules so the DB will scale according to the load.
-
VLLM: This is by far the fastest LLM serving engine I've used. We can scale it to use any number of GPUs within a single node, and we can set up multi-node inference using a Ray cluster. It's also possible to set up k8s load-based scaling, provided that we define the resources properly.
-
Embedding: In this case, I've used local serving for the embedding. Ideally, I would opt for serving the embedding separately and scaling them independently.
Put together, the system design would look something like this:
Install the required dependencies:
pip install -r requirements.txt
Set up Milvus (this will be our vector DB):
Milvus runs as a Docker container. There are scripts in the Milvus docs:
# Download the installation script
curl -sfL https://raw.githubusercontent.com/milvus-io/milvus/master/scripts/standalone_embed.sh
# Start the Docker container
bash standalone_embed.sh start
Make sure to download the dataset from Kaggle!
Check the folder 1. Milvus Setup Testing. It contains the default template "Hello World" for Milvus. This installation assumes you used the install script, which runs Milvus at port 19530.
We are using Fastembed backend for the embedding model. Let's test the model. During the first-time test, the model is also downloaded, which saves us time in later stages.
The folder 2. Embedding Setup Testing contains a script - test.py that downloads the model, loads it, and encodes a few sample sentences.
We use a template code from LlamaIndex docs to check if our LlamaIndex installation works as intended. For testing purposes, I used an OpenAI key at this stage to minimize bugs. This will change in later stages.
The dataset is a single CSV file. I've converted each row (which represents each episode) into its own JSON file (see data_splitter.py). This makes organizing much easier. Another advantage of this approach is that the file name is automatically included in the metadata of each entry in the VectorDB, making it easier to organize the DB without parsing other columns in the CSV to generate the metadata.
Now, I can ingest the data into the VectorDB. Please note that this step is a one-time process, and using a GPU at this stage is recommended. CPU-only ingestion takes a very long time. On my laptop (with RTX 3060), it took ~5 minutes. On CPU, the estimated time was ~4 hours (i7, 8C 16T CPU). The script to ingest the VectorDB from the generated JSON files can be found here - ingest.py.
If you installed Attu, then you should be able to see this data over there
At this stage, we are going to put together our VectorDB and embedding model with LlamaIndex and test if we can do a retrieval with it. This script loads up the embedding model and VectorDB. If you get an output at this stage, congrats! The setup is done! Kinda...
5. Sample Inference LlamaIndex VectorDB
This uses the Streamlit template to have a multi-turn chat with our VectorDB as the chat engine.
I've hosted Llama 3 8B on an A100 instance using VLLM. This model runs on VLLM, using FP8 cache, which enables the model to run with high throughput using just ~20GB VRAM (allowing it to run on cheaper GPUs like L4). I've hosted this model and used a reverse proxy to one of my subdomains - https://llm.parasu.in/v1/
.
Here, we use the OpenAILike class from LlamaIndex, which can load models that have OpenAI-compatible endpoints. The base URL, API key, and the model name are kept as inputs in the Streamlit app itself. At this point, the Milvus URL is hardcoded, but it can be replaced with an environment variable when needed. The code is well-commented and should be self-explanatory.
Given a query, it outputs the answer, the Hypothetical Document it generated, along with the sources of the retervial.