triton-inference-server/server

Ragged batching support for ML backends

Closed this issue · 3 comments

Is your feature request related to a problem? Please describe.

In scientific data analysis workflows, a single "event" (corresponding to a single request to the Triton server), may contain multiple objects such as clusters, each of which has a different number of components, and therefore a different number of features. The input data for these variable objects can be sent in a single request using "ragged batching", but currently this is only supported for the TensorRT backend (and any custom backends that happen to implement it).

Describe the solution you'd like
It would be very useful to support this feature in common ML backends: ONNX, PyTorch, TensorFlow.

Describe alternatives you've considered
Feature vectors can be padded to a universal length on the client side, but this is tedious (and potentially less efficient, as the request consumes more bandwidth to transmit useless information).

How is this different than the ragged batching support?

dzier commented

Triton does support ragged batching. You can specify ragged batching in the model config proto file. For example: https://github.com/triton-inference-server/common/blob/main/protobuf/model_config.proto#L390

To clarify, ragged batching support, initially only present for TensorRT, was added to the server for TF and ONNX in the past few releases. (PyTorch will have to wait for the conclusion of the nestedtensor project.)