This repository provides scripts and instructions for preparing and deploying Large Language Models (LLMs) using the Triton Inference Server with TensorRT-LLM. With a streamlined workflow, you can efficiently deploy LLM models in production environments.
To get started, follow these steps:
-
Cd to main directory:
cd DEPLOY_T5
-
Build a Docker image from the DockerFile:
docker build -t triton-llm-deployment .
-
Create a container and run the provided bash files inside the container:
docker run -it --name triton-llm-container triton-llm-deployment
-
Within the container, execute the bash files to prepare and deploy your LLM model:
bash scripts/1.install_git_and_lfs.sh bash scripts/2.install_tensorrt_llm.sh bash scripts/3.trendyol_llm_tensorrt_engine_build_and_test.sh bash scripts/4.create_triton_model_repository.sh bash scripts/5.run triton.sh bash scripts/6.call_triton_model_curl_example.sh
1.install_git_and_lfs.sh
: Installs Git and Git LFS if not already installed.2.install_tensorrt_llm.sh
: Fetches the tensorrtllm_backend project and its submodules, ensuring compatibility with Triton.3.trendyol_llm_tensorrt_engine_build_and_test.sh
: Downloads and converts the Trendyol/Trendyol-LLM-7b-chat-v1.0 model from Hugging Face into a TensorRT-LLM checkpoint, and then creates a TensorRT-LLM engine from this checkpoint for testing.4.create_triton_model_repository.sh
: Creates the Triton model repository and copies model definition files into it.5.run_triton.sh
: Starts the Triton server to serve the deployed LLM model.6.call_triton_model_curl_example.sh
: Provides an example of how to make requests to the deployed LLM model using cURL.