/rapids-triton

Primary LanguageC++Apache License 2.0Apache-2.0

License

The RAPIDS-Triton Library

This project is designed to make it easy to integrate any C++-based algorithm into the NVIDIA Triton Inference Server. Originally developed to assist with the integration of RAPIDS algorithms, this library can be used by anyone to quickly get up and running with a custom backend for Triton.

Background

Triton

The NVIDIA Triton Inference Server offers a complete open-source solution for deployment of machine learning models from a wide variety of ML frameworks (PyTorch, Tensorflow, ONNX, XGBoost, etc.) on both CPU and GPU hardware. It allows you to maximize inference performance in production (whether that means maximizing throughput, minimizing latency, or optimizing some other metric) regardless of how you may have trained your ML model. Through smart batching, efficient pipeline handling, and tools to simplify deployments almost anywhere, Triton helps make production inference serving simpler and more cost-effective.

Custom Backends

While Triton natively supports many common ML frameworks, you may wish to take advantage of Triton's features for something a little more specialized. Triton provides support for different kinds of models via "backends:" modular libraries which provide the specialized logic for those models. Triton allows you to create custom backends in Python, but for those who wish to use C++ directly, RAPIDS-Triton can help simplify the process of developing your backend.

The goal of RAPIDS-Triton is not to facilitate every possible use case of the Triton backend API but to make the most common uses of this API easier by providing a simpler interface to them. That being said, if there is a feature of the Triton backend API which RAPIDS-Triton does not expose and which you wish to use in a custom backend, please submit a feature request, and we will see if it can be added.

Simple Example

In the cpp/src directory of this repository, you can see a complete, annotated example of a backend built with RAPIDS-Triton. The core of any backend is defining the predict function for your model as shown below:

  void predict(rapids::Batch& batch) const {
    rapids::Tensor<float> input = get_input<float>(batch, "input__0");
    rapids::Tensor<float> output = get_output<float>(batch, "output__0");

    rapids::copy(output, input);

    output.finalize();
  }

In this example, we ask Triton to provide a tensor named "input__0" and copy it to an output tensor named "output__0". Thus, our "inference" function in this simple example is just a passthrough from one input tensor to one output tensor.

To do something more sophisticated in this predict function, we might take advantage of the data() method of Tensor objects, which provides a raw pointer (on host or device) to the underlying data along with size(), and mem_type() to determine the number of elements in the Tensor and whether they are stored on host or device respectively. Note that finalize() must be called on all output tensors before returning from the predict function.

For a much more detailed look at developing backends with RAPIDS-Triton, check out our complete usage guide.

Contributing

If you wish to contribute to RAPIDS-Triton, please see our contributors' guide for tips and full details on how to get started.