Back pressure for thread pool task queue
mrjackbo opened this issue · 3 comments
Hi,
starting from the basic inference example, how would you advise to implement a simple back pressure mechanism? If I understand correctly, the task queue of the thread pool implementation that Tensorrt lab uses, does not have an upper size limit. Thus, if for example, my data ingest is much faster than inference, the program will eventually run out of memory, as the input tensors have to be captured by the lambdas in the task queue. It would be nice to have an optional behavior, where enqueue becomes blocking once the task queue reaches a certain size.
Great question.
The nvRPC examples, similar to TRTIS, have several limits on the depth of the work queues.
- the number of gRPC contexts registered with the executor only allow for XXX gRPC messages to be pullled off the wire
executor->RegisterContexts(rpcCompute, rpcResources, XXX);
- if you are using the c++ or the python tensorrt runtime,
InferRunner
, has limits of the queue depth enforced by theInferenceManager
. We provide a canonical example of what's happening in theInferRunner
which demonstrates where the a call to the runtime might block.
auto buffers = GetResources()->GetBuffers();
- if you use the offload variant of the
Infer
method, then yes, you could get into a situation where your queue depth becomes unbounded and you run out of resources. - if you use the direct variant which operates on a
Bindings
object, then you've had to acquire aBindings
object, which similar to the canonical example means you've been limited by theInferenceManager
and the call may have blocked
- if you use the offload variant of the
Does this help?
The second part of your question on recognizing and reacting to queue depth.
Both TRTIS and the some of the TRTLAB examples expose Prometheus metrics. The load ratio = request_time / compute_time
is a nice way to gauge queue depth. If you just measure queue depth, that doesn't help you distinguish the type of model in the queue. A large queue depth of models that don't take long to compute means something different than a queue depth of a model that takes a long time to compute. Load Ratio help normalize that with respect to the time the model takes to compute. You can use the load ratio and gpu energy consumption metrics to trigger horizontal auto scaling. Follow Issue #20 to track progress on this.
One thing we don't do in either the LAB or TRTIS is check the gRPC deadline. We should do this. This is one reactive way we can deal with queue depth is to simply start canceling requests on the server side. If you control the client, then this is a good feedback mechanism that you need to grow the number of workers. Follow Issue #21 to track progress.
Thanks! Yes, that helps a lot. I will use a fixed size ressource pool of buffers, into which my IO threads write the input blobs.
Let me know if you run into any problems or if we can improve the experience in any way.