State-of-the-art Generative AI examples that are easy to deploy, test, and extend. All examples run on the high performance NVIDIA CUDA-X software stack and NVIDIA GPUs.
- Tips for Building a RAG Pipeline with NVIDIA AI LangChain AI Endpoints by Amit Bleiweiss. [Blog, notebook]
- Experimental examples:
Generative AI Examples can use models and GPUs from the NVIDIA API Catalog.
Sign up for a free NGC developer account to access:
- GPU-optimized containers used in these examples
- Release notes and developer documentation
A RAG pipeline embeds multimodal data -- such as documents, images, and video -- into a database connected to a LLM. RAG lets users chat with their data!
The developer RAG examples run on a single VM. The examples demonstrate how to combine NVIDIA GPU acceleration with popular LLM programming frameworks using NVIDIA's open source connectors. The examples are easy to deploy with Docker Compose.
Examples support local and remote inference endpoints. If you have a GPU, you can inference locally with an NVIDIA NIM for LLMs. If you don't have a GPU, you can inference and embed remotely with NVIDIA API Catalog endpoints.
Model | Embedding | Framework | Description | Multi-GPU | TRT-LLM | NVIDIA Endpoints | Triton | Vector Database |
---|---|---|---|---|---|---|---|---|
llama3-70b | snowflake-arctic-embed-l | LangChain | NVIDIA API Catalog endpoints chat bot [code, docs] | No | No | Yes | Yes | Milvus or pgvector |
llama3-8b | snowflake-arctic-embed-l | LlamaIndex | Canonical QA Chatbot [code, docs] | Yes | Yes | No | Yes | Milvus or pgvector |
llama3-70b | snowflake-arctic-embed-l | LangChain | Chat bot with query decomposition agent [code, docs] | No | No | Yes | Yes | Milvus or pgvector |
llama3-70b | ai-embed-qa-4 | LangChain | Minimilastic example: RAG with NVIDIA AI Foundation Models [code, README] | No | No | Yes | Yes | FAISS |
llama3-8b Deplot Neva-22b |
snowflake-arctic-embed-l | Custom | Chat bot with multimodal data [code, docs] | No | No | Yes | No | Milvus or pvgector |
llama3-70b | none | PandasAI | Chat bot with structured data [code, docs] | No | No | Yes | No | none |
llama3-8b | snowflake-arctic-embed-l | LangChain | Chat bot with multi-turn conversation [code, docs] | No | No | Yes | No | Milvus or pgvector |
The enterprise RAG examples run as microservices distributed across multiple VMs and GPUs. These examples show how to orchestrate RAG pipelines with Kubernetes and deployed with Helm.
Enterprise RAG examples include a Kubernetes operator for LLM lifecycle management. It is compatible with the NVIDIA GPU Operator that automates GPU discovery and lifecycle management in a Kubernetes cluster.
Enterprise RAG examples also support local and remote inference with TensorRT-LLM and NVIDIA API Catalog endpoints.
Model | Embedding | Framework | Description | Multi-GPU | Multi-node | TRT-LLM | NVIDIA Endpoints | Triton | Vector Database |
---|---|---|---|---|---|---|---|---|---|
llama-3 | nv-embed-qa-4 | LlamaIndex | Chat bot, Kubernetes deployment [chart] | No | No | Yes | No | Yes | Milvus |
The generative AI model examples include end-to-end steps for pre-training, customizing, aligning and running inference on state-of-the-art generative AI models leveraging the NVIDIA NeMo Framework
Model | Resources(s) | Framework | Description |
---|---|---|---|
gemma | Docs, LoRA, SFT | NeMo | Aligning and customizing Gemma, and exporting to TensorRT-LLM format for inference |
codegemma | Docs, LoRA | NeMo | Customizing Codegemma, and exporting to TensorRT-LLM format for inference |
starcoder-2 | LoRA, Inference | NeMo | Customizing Starcoder-2 with NeMo Framework, optimizing with NVIDIA TensorRT-LLM, and deploying with NVIDIA Triton Inference Server |
small language models (SLMs) | Docs, Pre-training and SFT, Eval | NeMo | Training, alignment, and running evaluation on SLMs using various techniques |
Example tools and tutorials to enhance LLM development and productivity when using NVIDIA RAG pipelines.
Name | Description | NVIDIA Endpoints |
---|---|---|
Evaluation | RAG evaluation using synthetic data generation and LLM-as-a-judge [code, docs] | Yes |
Observability | Monitoring and debugging RAG pipelines [code, docs] | Yes |
These are open source connectors for NVIDIA-hosted and self-hosted API endpoints. These open source connectors are maintained and tested by NVIDIA engineers.
Name | Framework | Chat | Text Embedding | Python | Description |
---|---|---|---|---|---|
NVIDIA AI Foundation Endpoints | Langchain | Yes | Yes | Yes | Easy access to NVIDIA hosted models. Supports chat, embedding, code generation, steerLM, multimodal, and RAG. |
NVIDIA Triton + TensorRT-LLM | Langchain | Yes | Yes | Yes | This connector allows Langchain to remotely interact with a Triton inference server over GRPC or HTTP for optimized LLM inference. |
NVIDIA Triton Inference Server | LlamaIndex | Yes | Yes | No | Triton inference server provides API access to hosted LLM models over gRPC. |
NVIDIA TensorRT-LLM | LlamaIndex | Yes | Yes | No | TensorRT-LLM provides a Python API to build TensorRT engines with state-of-the-art optimizations for LLM inference on NVIDIA GPUs. |
-
NVIDIA Tokkio LLM-RAG: Use Tokkio to add avatar animation for RAG responses.
-
RAG on Windows using TensorRT-LLM and LlamaIndex: Create RAG chatbots on Windows using TensorRT-LLM.
-
Hybrid RAG Project on AI Workbench: Run an NVIDIA AI Workbench example project for RAG.
-
Refer to the releases page for information about previous releases.
We're posting these examples on GitHub to support the NVIDIA LLM community and facilitate feedback. We invite contributions via GitHub Issues or pull requests!
- Some known issues are identified as TODOs in the Python code.
- The datasets provided as part of this project are under a different license for research and evaluation purposes.
- This project downloads and installs third-party open source software projects. Review the license terms of these open source projects before use.