Triton Server on Lightning AI

Introduction

Triton server component enables you to deploy your model to Triton Inference Server and setup a FastAPI interface for converting api datatypes (string, integer, float etc) to and from Triton datatypes (DT_STRING, DT_INT32 etc).

What is Triton

Triton Inference Server is an open-source deep learning inference server designed by Nvidia to make AI model deployment easy and efficient. It supports multiple model formats and hardware platforms, and help utilize the compute efficiently by batching requests and optimizing the model execution. For more details, refer the developer blog from Nvidia

Let's do an example

We'll use the Triton Server component in this example to serve a torch vision model

Save the following code as torch_vision_server.py

# !pip install torch torchvision pillow
# !pip install lightning_triton@git+https://github.com/Lightning-AI/LAI-Triton-Server-Component.git
import lightning as L
import base64, io, torch, torchvision, lightning_triton as lt
from PIL import Image


class TorchvisionServer(lt.TritonServer):
    def __init__(self, input_type=lt.Image, output_type=lt.Category, **kwargs):
        super().__init__(
            input_type=input_type, output_type=output_type, max_batch_size=8, **kwargs
        )
        self._device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        self._model = None

    def setup(self):
        self._model = torchvision.models.resnet18(
            weights=torchvision.models.ResNet18_Weights.DEFAULT
        )
        self._model.to(self._device)

    def predict(self, request):
        image = base64.b64decode(request.image.encode("utf-8"))
        image = Image.open(io.BytesIO(image))
        transforms = torchvision.transforms.Compose(
            [
                torchvision.transforms.Resize(224),
                torchvision.transforms.ToTensor(),
                torchvision.transforms.Normalize(
                    [0.485, 0.456, 0.406], [0.229, 0.224, 0.225]
                ),
            ]
        )
        image = transforms(image)
        image = image.to(self._device)
        prediction = self._model(image.unsqueeze(0))
        return {"category": prediction.argmax().item()}


cloud_compute = L.CloudCompute("gpu", shm_size=512)
app = L.LightningApp(TorchvisionServer(cloud_compute=cloud_compute))

Install lightning

If you don't have lightning installed yet, install it using

pip install -U lightning

Run it locally

Since installing Triton can be tricky (and not officially supported) in different operating systems, we use docker internally to run the Triton server. This component expects the docker is already installed in your system. If you don't have docker installed, you can install it from here

Note that you don't need to install docker if you are running the component only on cloud. Keep in mind that the docker image is very huge (about 20 GB) and can affect the startup time on the first time you run it.

Run it locally using

lightning run app torch_vision_server.py --setup

Run it in the cloud

Run it in the cloud using

lightning run app torch_vision_server.py --setup --cloud

More examples

Check out more examples that serve different model types in the example directory. Follow the instructions for each of those here

Benchmark

Triton Server is in very early stages of development and is not yet optimized for performance. But we'll be tracking the progress with the help of benchmarks provided in this section. Here we are comparing the performance of Triton Server with PythonServer. Below given are the results of benchmarking on two different GPU instances using the stable diffusion component. For more details, refer the benchmarking section of stable diffusion component.

Device	Server Type	Req/Sec	Latency	Batch Size
gpu-rtx (g5.2xlarge)	PythonServer	~0.2	7s	1
gpu-rtx (g5.2xlarge)	TritonServer	~0.1	7.3s	1
gpu-fast (p3.2xlarge)	PythonServer	~0.2	6s	1
gpu-fast (p3.2xlarge)	TritonServer	~0.1	7.5s	1

Next Steps

At present, our focus is on improving the performance of Triton Server and that includes tackling the following issues

Dynamic batching with python backend
Supporting TensorRT backend
Dynamic batching with TensorRT backend
Concurrent model execution

Known Limitations

This component is still in the early stages of development. Here are some of the known limitations that are being worked on: If you have issues with any of these or if you find other issues, please create a Github issue so we can prioritise them.

When running locally, it requires ctrl-c to be pressed twice to stop all the processes
Running locally requires docker to be installed
Only python backend is supported for the Triton server. This means, a lot of optimizations specific to other backends, like TensorRT for example, cannot be utilized with this component yet
Not all the features of Triton are configurable through the component yet.
Only four datatypes are supported at the API level (string, integer, float, bool)
Providing a pre-created Model Repository to the component is not supported yet. This means if you have an existing model repository, you cannot use it with this component yet

Lightning-Universe/Triton-Server_component