This is a Retrieval Augmented Generation application with a customizable Gradio Chat app. It lets you:
- Embed your documents into a vector database running locally.
- Use models like LLaMa 2 70B or Mixtral 7B in the cloud via NVIDIA inference endpoints.
- Run quantized versions of Mistral 7B and LLaMa 2 7B locally on a GPU of 12 GB vRAM or higher.
- Use your own self-hosted microservice to run different models via NVIDIA NeMo Inference Microservices (NIMs).
This is how to use this project to run RAG using inference via NVIDIA cloud endpoints. If you get stuck, go to "Troubleshooting".
- You need an NGC account to get an NVCF run key. Create one here.
- You need an NVCF run key to access the NVIDIA endpoints. Once you have an NGC account, create a run key here.
- You need a Hugging Face API token. See how to create one here.
- Install and configure AI Workbench locally and open up AI Workbench. Select a location of your choice.
- Fork this repo into your own GitHub account.
- In AI Workbench:
- Clone the forked repo with the url. Hint: Click
Clone
and enter the repo URL. - The repo will clone and Workbench will build the container, which can take between 10 and 20 minutes.
- After the container builds, open the
Chat
app. Hint: Click the green button at top right. - When prompted, enter your Hugging Face token and NVIDIA NVCF run key as secrets.
- Open the
Chat
again, and the Gradio app will open in a browser. This takes around 30 seconds.
- Clone the forked repo with the url. Hint: Click
- In the Gradio Chat app:
- Select the
Cloud
option and submit a query. The first query triggers a backend build, which takes a minute. - To perform RAG, select Upload Documents Here from the right hand panel of the chat UI.
- You may see a warning that the vector database is not ready yet. If so wait a moment and try again.
- When the database starts, select Update Database and choose the text files to upload.
- Once the files upload, the Toggle to Use Vector Database next to the text input box will turn on.
- Now query your documents! What are they telling you?
- To change the endpoint, select a different model from the right-hand dropdown and continue querying.
- Select the
For the other supported inference modes, check out the "Advanced Tutorials" section below.
Note: NVIDIA AI Workbench is the easiest way to get this RAG app running.
- NVIDIA AI Workbench is a free client application that you can install on your own machines.
- It provides portable and reproducible dev environments by handling Git repos and containers for you.
- See how to install it here for Windows, for Ubuntu 22.04 and for macOS 12 or higher
-
Make sure you installed Workbench. There should be a desktop icon on your system. Double click it to start Workbench.
-
Make sure you opened Workbench.
-
Click on the
Local
location. -
If this is your first project, click the green
Clone Existing Project
button.- Otherwise, click "Clone Project" in the top right
-
Drop in the repo URL, leave the default path, and click
Clone
.
-
The container is building and can take between 10 and 20 minutes.
-
Look at the very bottom-right of the Workbench window, you will see a
Build Status
widget. -
Click it to expand the build output.
-
When the container is built, the widget will say
Build Ready
. -
Now you can begin.
-
Check that the container finished building.
-
When it finishes, click the green
Open Chat
button at the top right.
-
Check that the container is built.
-
Then click the green dropdown next to the
Open Chat
button at the top right. -
Select JupyterLab to start editing the code.
This section shows you how to use difference inference modes with this RAG project. To do these tutorials you need a GPU of at least 12 GB of vRAM. If you don't have one, go back to the Quickstart Tutorial that shows how to use Cloud Endpoints.
This tutorial assumes you already cloned this Hybrid RAG project to your AI Workbench. If not, please follow the beginning of the Quickstart Tutorial.
Inference
- Open the Chat app from the AI Workbench project window. Hint: It's the big green button at the top right.
- You may be prompted to enter your NVCF and Hugging Face keys as project secrets. If so, do it and then select Open Chat again.
- If you aren't prompted to enter the keys, you entered them previously. Find them AI Workbench→ Environment→Secrets.
- Once the UI opens, select the Local System inference mode under Inference Settings → Inference Mode. Wait for the RAG backend to start. It may take a minute.
- Select a model from the dropdown on the right hand settings panel. Mistral 7B and Llama 2 are currently supported.
- Mistral 7B: This model is ungated and is easiest to use.
- Llama 2: This model is gated. Ensure the Hugging Face API Token is configured properly. You can edit this under Environment→Secrets→
HUGGING_FACE_HUB_TOKEN
, and restart the environment if needed. - You can also enter a custom model from Hugging Face as text, following the same format. Careful. Not all models and quantization levels are supported in this RAG!
- Select a quantization level. Full, 8-bit, and 4-bit precision levels are currently supported.
| vRAM | System RAM | Disk Storage | Model Size & Quantization |
|---------|------------|--------------|---------------------------|
| >=12 GB | 32 GB | 40 GB | 7B & int4 |
| >=24 GB | 64 GB | 40 GB | 7B & int8 |
| >=40 GB | 64 GB | 40 GB | 7B & none |
- Select Load Model to pre-fetch the model. Timing can vary between a few minutes and 20 minutes, based on your network.
- Select Start Server to start the inference server with your current local GPU. This may take a moment to warm up.
- Now, start chatting! Queries will be made to the model running on your local system whenever this inference mode is selected.
Using RAG
- In the right hand panel of the Chat UI select Upload Documents Here→Update Database and choose the text files to upload.
- Once the files upload, the Toggle to Use Vector Database next to the text input box will turn on by default.
- Now query your documents! To use a different model, stop the server, make your selections, and restart the inference server.
This tutorial assumes you already cloned this Hybrid RAG project to your AI Workbench. If not, please follow the beginning of the Quickstart Tutorial.
Prerequisites
- Set up your NVIDIA NeMo Inference Microservice to run on another system of your choice. After joining the EA Program, the playbook to get started is located here.
Inference
- Open the Chat application from the AI Workbench project window.
- You may be prompted to enter your NVCF and Hugging Face keys as project secrets. Do that and then select Open Chat again.
- If you aren't prompted, you already entered the keys. See them in AI Workbench under Environment→Secrets.
- Once the UI opens, select the Self-hosted Microservice inference mode under Inference Settings → Inference Mode. Wait for the RAG backend to start up, which may take a few moments.
- Select the Remote tab in the right hand settings panel. Input the IP address of the system running the microservice, as well as the model name selected to run with that microservice.
- Now start chatting! Queries will be made to the microservice running on a remote system whenever this inference mode is selected.
Using RAG
- To perform RAG, in the right hand panel of the Chat UI select Upload Documents Here →Update Database and choose the text files to upload.
- Once uploaded successfully, the Toggle to Use Vector Database should turn on by default next to your text input box.
- Now you may query your documents!
If you don't have Docker experience, don't try this section. If you do have some Docker experience, it should be fairly straight forward.
Spinning up a Microservice locally from inside the AI Workbench Hybrid RAG project is an area of active development. This tutorial has been tested on 1x RTX 4090 and is currently being improved.
Here are some important PREREQUISITES:
- This tutorial assumes you already have this Hybrid RAG project cloned to your AI Workbench. If not, please first follow steps 1-5 of the project Quickstart.
- Your AI Workbench must be running with a DOCKER container runtime. Podman is currently unsupported.
- You must already be accepted into the NeMo Inference Microservice EA Program.
- You must have generated your own TRT-LLM model engine files in some model store directory located on your local system. These are models you would like to serve for inference.
- Shut down any locally-running inference servers (eg. from Tutorial 1), as these may result in memory issues when running the microservice locally.
Inference
- In the AI Workbench project window, navigate to Environment → Mounts → Add. Add the following host mount:
- Type: Host Mount
- Target:
/opt/host-run
- Source:
/var/run
- Description: Mount for Docker socket (NIM on Local RTX)
- Navigate to Environment→Secrets. Configure the existing secrets and create a new secret with the following details.
- Name: NGC_CLI_API_KEY
- Value: (Your NGC API Key)
- Description: NGC API Key for NIM access
- Navigate to Environment→Variables. Ensure the following are configured. Restart your environment if needed.
- DOCKER_HOST: location of your docker socket, eg.
unix:///opt/host-run/docker.sock
- LOCAL_NIM_MODEL_STORE: location of your
model-store
directory, eg./mnt/c/Users/NVIDIA/model-store
- DOCKER_HOST: location of your docker socket, eg.
- Open the Chat application from the AI Workbench project window.
- You may be prompted to enter your NVCF and Hugging Face keys as project secrets. You may do so, and then select Open Chat again.
- If you are given no prompt, you may have already entered the keys before. You may find them in AI Workbench under Environment→Secrets.
- Once the UI opens, select the Self-hosted Microservice inference mode under Inference Settings → Inference Mode. Wait for the RAG backend to start up, which may take a few moments.
- Select the Local (RTX) tab in the right hand settings panel. Input the model name of your TRT-LLM engine file. Select Start Microservice Locally. This may take a few moments to complete.
- Now, you can start chatting! Queries will be made to your microservice running on the local system whenever this inference mode is selected.
Using RAG
- To perform RAG, select Upload Documents Here from the right hand panel of the chat UI. Select Update Database and choose the text files to upload.
- Once uploaded successfully, the Toggle to Use Vector Database should turn on by default next to your text input box.
- Now you may query your documents!
- In AI Workbench, open JupyterLab. Hint: Its in the dropdown for the green button at the top right.
- Go into the
code/chatui/
folder and start editing the files. - Save the files.
- To see your changes, stop the Chat UI and restart it.
- To version your changes, commit them in the Workbench project window and push to your GitHub repo.
In addition to modifying the Gradio frontend, you can also use the Jupyterlab to customize other aspects of the project, eg. custom chains, backend server, scripts, etc.
This NVIDIA AI Workbench example project is under the Apache 2.0 License
This project may download and install additional third-party open source software projects. Review the license terms of these open source projects before use. Third party components used as part of this project are subject to their separate legal notices or terms that accompany the components. You are responsible for confirming compliance with third-party component license terms and requirements.