Slow streaming

Question

Slow streaming

Closed this issue 4 months ago · 1 comments

Streaming is extremely slow. The intended effect is to have it look like its typing of course, but instead its just loading in laggy chunks. A GPU pod works fine, just serverless endpoint that causes this. Unforunately until this is better we're forced to use HuggingFace serverless.

Answer 1 · 2024-06-12T19:43:44.000Z

Hi, can you expand on this? Were you streaming with OpenAI compatibility or RunPod's streaming feature, were you running the worker code on a pod or vLLM directly?

For the vLLM worker, I implemented dynamic batching for streaming tokens, which maximizes concurrent throughput while providing time-to-first-token similar to having no batching at all - it seems like you would like each streamed batch of tokens to be smaller to have a rate closer to typing, so you can simply set the associated env vars/request params to lower values:

Name	Default	Type/Choices	Description
Streaming Batch Size Settings:
`DEFAULT_BATCH_SIZE`	`50`	`int`	Default and Maximum batch size for token streaming to reduce HTTP calls.
`DEFAULT_MIN_BATCH_SIZE`	`1`	`int`	Batch size for the first request, which will be multiplied by the growth factor every subsequent request.
`DEFAULT_BATCH_SIZE_GROWTH_FACTOR`	`3`	`float`	Growth factor for dynamic batch size.
The way this works is that the first request will have a batch size of `DEFAULT_MIN_BATCH_SIZE`, and each subsequent request will have a batch size of `previous_batch_size * DEFAULT_BATCH_SIZE_GROWTH_FACTOR`. This will continue until the batch size reaches `DEFAULT_BATCH_SIZE`. E.g. for the default values, the batch sizes will be `1, 3, 9, 27, 50, 50, 50, ...`. You can also specify this per request, with inputs `max_batch_size`, `min_batch_size`, and `batch_size_growth_factor`. This has nothing to do with vLLM's internal batching, but rather the number of tokens sent in each HTTP request from the worker

Note that this will lower your performance if you are planning to serve more than 1 request at a time. Though the "laggy chunks" might seem slower, the throughput of tokens per second is actually much higher. Performance-wise, the ideal solution is that you control the speed at which you show the received the tokens to the user, which would allow you to maximize the throughput while achieving a typing effect that you can set to whatever speed you want.