runpod-workers/worker-vllm

Slow streaming

Closed this issue · 1 comments

Streaming is extremely slow. The intended effect is to have it look like its typing of course, but instead its just loading in laggy chunks. A GPU pod works fine, just serverless endpoint that causes this. Unforunately until this is better we're forced to use HuggingFace serverless.

Hi, can you expand on this? Were you streaming with OpenAI compatibility or RunPod's streaming feature, were you running the worker code on a pod or vLLM directly?

For the vLLM worker, I implemented dynamic batching for streaming tokens, which maximizes concurrent throughput while providing time-to-first-token similar to having no batching at all - it seems like you would like each streamed batch of tokens to be smaller to have a rate closer to typing, so you can simply set the associated env vars/request params to lower values:

Name Default Type/Choices Description
Streaming Batch Size Settings:
DEFAULT_BATCH_SIZE 50 int Default and Maximum batch size for token streaming to reduce HTTP calls.
DEFAULT_MIN_BATCH_SIZE 1 int Batch size for the first request, which will be multiplied by the growth factor every subsequent request.
DEFAULT_BATCH_SIZE_GROWTH_FACTOR 3 float Growth factor for dynamic batch size.
The way this works is that the first request will have a batch size of DEFAULT_MIN_BATCH_SIZE, and each subsequent request will have a batch size of previous_batch_size * DEFAULT_BATCH_SIZE_GROWTH_FACTOR. This will continue until the batch size reaches DEFAULT_BATCH_SIZE. E.g. for the default values, the batch sizes will be 1, 3, 9, 27, 50, 50, 50, .... You can also specify this per request, with inputs max_batch_size, min_batch_size, and batch_size_growth_factor. This has nothing to do with vLLM's internal batching, but rather the number of tokens sent in each HTTP request from the worker

Note that this will lower your performance if you are planning to serve more than 1 request at a time. Though the "laggy chunks" might seem slower, the throughput of tokens per second is actually much higher. Performance-wise, the ideal solution is that you control the speed at which you show the received the tokens to the user, which would allow you to maximize the throughput while achieving a typing effect that you can set to whatever speed you want.