/trt-llm-rag-windows

A developer reference project for creating Retrieval Augmented Generation (RAG) chatbots on Windows using TensorRT-LLM

Primary LanguagePythonOtherNOASSERTION

🚀 RAG on Windows using TensorRT-LLM and LlamaIndex 🦙

This repository showcases a Retrieval-augmented Generation (RAG) pipeline implemented using the llama_index library for Windows. The pipeline incorporates the LLaMa 2 13B model, TensorRT-LLM, and the FAISS vector search library. For demonstration, the dataset consists of thirty recent articles sourced from NVIDIA Geforce News.

What is RAG? 🔍

Retrieval-augmented generation (RAG) for large language models (LLMs) seeks to enhance prediction accuracy by leveraging an external datastore during inference. This approach constructs a comprehensive prompt enriched with context, historical data, and recent or relevant knowledge.

Getting Started

Important

This branch is valid only for TRT-LLM v0.5. For the instructions compatible with the latest TRT-LLM release (v0.7.1), please switch to this branch: trt-llm-0.7.1

Ensure you have the pre-requisites in place:

  1. Install TensorRT-LLM for Windows using the instructions here.

  2. Ensure you have access to the Llama 2 repository on Huggingface

  3. In this project, the LLaMa 2 13B AWQ 4bit quantized model is employed for inference. Before using it, you'll need to compile a TensorRT Engine specific to your GPU. If you're using the GeForce RTX 4090 (TensorRT 9.1.0.4 and TensorRT-LLM release 0.5.0), the compiled TRT Engine is available for download here. For other NVIDIA GPUs or TensorRT versions, please refer to the instructions.

Setup Steps

  1. Clone this repository:
git clone https://github.com/NVIDIA/trt-llm-rag-windows.git
  1. Place the TensorRT engine for LLaMa 2 13B model in the model/ directory
  • For GeForce RTX 4090 users: Download the pre-built TRT engine here and place it in the model/ directory.
  • For other NVIDIA GPU users: Build the TRT engine by following the instructions provided here.
  1. Acquire the llama tokenizer (tokenizer.json, tokenizer.model and tokenizer_config.json) here.
  2. Download AWQ weights for building the TensorRT engine model.pt here. (For RTX 4090, use the pregenerated engine provided earlier.)
  3. Install the necessary libraries:
pip install -r requirements.txt
  1. Launch the application using the following command:
python app.py --trt_engine_path <TRT Engine folder> --trt_engine_name <TRT Engine file>.engine --tokenizer_dir_path <tokernizer folder> --data_dir <Data folder>

In our case, that will be:

python app.py --trt_engine_path model/ --trt_engine_name llama_float16_tp1_rank0.engine --tokenizer_dir_path model/ --data_dir dataset/

Note: On its first run, this example will persist/cache the data folder in vector library. Any modifications in the data folder won't take effect until the "storage-default" cache directory is removed from the application directory.

Detailed Command References

python app.py --trt_engine_path <TRT Engine folder> --trt_engine_name <TRT Engine file>.engine --tokenizer_dir_path <tokernizer folder> --data_dir <Data folder>

Arguments

Name Details
--trt_engine_path <> Directory of TensorRT engine
--trt_engine_name <> Engine file name (e.g. llama_float16_tp1_rank0.engine)
--tokenizer_dir_path <> HF downloaded model files for tokenizer e.g. https://huggingface.co/meta-llama/Llama-2-13b-chat-hf
--data_dir <> Directory with context data (pdf, txt etc.) e.g. ".\dataset"

Building TRT Engine

For RTX 4090 (TensorRT 9.1.0.4 & TensorRT-LLM 0.5.0), a prebuilt TRT engine is provided. For other RTX GPUs or TensorRT versions, follow these steps to build your TRT engine:

Download LLaMa 2 13B chat model from https://huggingface.co/meta-llama/Llama-2-13b-chat-hf

Download LLaMa 2 13B AWQ int4 checkpoints model.pt from here

Clone the TensorRT LLM repository:

git clone https://github.com/NVIDIA/TensorRT-LLM.git

Navigate to the examples\llama directory and run the following script:

python build.py --model_dir <path to llama13_chat model> --quant_ckpt_path <path to model.pt> --dtype float16 --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --use_weight_only --weight_only_precision int4_awq --per_group --enable_context_fmha --max_batch_size 1 --max_input_len 3000 --max_output_len 1024 --output_dir <TRT engine folder>

Adding your own data

  • This app loads data from the dataset/ directory into the vector store. To add support for your own data, replace the files in the dataset/ directory with your own data. By default, the script uses llamaindex's SimpleDirectoryLoader which supports text files in several platforms such as .txt, PDF, and so on.

This project requires additional third-party open source software projects as specified in the documentation. Review the license terms of these open source projects before use.