aws/sagemaker-huggingface-inference-toolkit

How to dynamically batch to help handle high load?

jambran opened this issue · 3 comments

Hi there,

I'm trying to deploy an endpoint that has bursts of high load. I'd like the endpoint to batch requests so we can increase through put under high load at the cost of a slight increase in latency under low load.

I found a blog post about how this can be done through torchserve and aws. See the section TorchServe dynamic batching on SageMaker.

I'd like to have dynamic batching in a huggingface container, as I'm told there are optimizations taken for transformer models there.

I can see the param for batch_size in the handler_service.py code, but I'm not sure of the recommended way to adjust this, along with a parameter for max_batch_delay.

Is this something currently available?

I reached out to AWS for support, who suggested I open an issue here for assistance. Do let me know if this is more appropriate for a Q&A forum and point me there.

Thanks so much in advance,
Jamie

Hey @jambran,

could you maybe describe a bit more of your use case? When using batching with NLP/Transformers models, you need to ensure that the inputs have all the same size (same length). This is not supported for all pipelines in transformers and is not more efficient than sequential processing if you have inputs that differ in length quite a bit.

The toolkit is build on top of AWS MMS, which supports dynamic batching, more here: https://github.com/awslabs/multi-model-server/blob/master/docs/batch_inference_with_mms.md

But for my understanding is that SageMaker is currently not supporting it.

Thanks for the reply, @philschmid!

Our users will often paste large amounts of text into our system, which is then processed (currently) one sentence at a time. This takes quite a bit of time from start to finish. We've done some testing with user data, and we've found that small batches (4 or 8) allow for faster processing time overall. More often than not, sentences are roughly the same length, though I do hear your point about sentences with large differences in sequence length.

Thanks for sharing the link regarding dynamic batching with mms. I will take a look here.

@philschmid dynamic batching is nowadays necessary for most applications. TEI and TGI do dynamic batching the same as infinity - now in 2024 do you think we can customize transform_fn in a pain free way to do so? A tutorial would be highly appreciated -