triton-inference-server/server

Support input/output compression

Closed this issue · 5 comments

Is your feature request related to a problem? Please describe.
Some algorithms may require large amounts of input data, which could saturate network bandwidth for clients. (Generative or other algorithms may correspondingly return large amounts of output data.) It could be very useful in these cases to be able to compress the input and/or output, thereby trading CPU (for compressing/decompressing) with bandwidth, depending on which resource is the most limited.

Describe the solution you'd like
Consistent client/server options to enable compression for input and/or output, with different compression schemes and levels supported.

Describe alternatives you've considered
This could potentially be done manually, using a custom pre- or post-processor backend in an ensemble model that would understand how to (de)compress the data, with the reverse process implemented by the user along with their interface to the client. However, this would result in a lot of duplicated effort and possibly inefficiencies.

Additional context
N/A

GRPC has some compression options. Would you just like those exposed? HTTP would need a custom solution on both client and server side. Do you have any specific solution(s), compressions, etc. in mind?

Yes, exposing the gRPC compression options would be a great starting point. (We're not using the HTTP client right now, so that's lower priority.)

Exposed the gRPC compression options via PR #2628. The compression option for HTTP is still a TODO.

Hi, just saw that the HTTP compression was merged recently. Is it available in the 2.9 release ?
Seems like it mention only "The GRPC client libraries now allow compression to be enabled"
Thanks

It will be available for the 21.05 release.