How do I set the batch size for vLLM
lambda7xx opened this issue · 3 comments
##Question
- If I run the vllm offline and can I set the batch size ? I mean I want to test the its e2e latency for different batch size.
For offline inference, you can set the max batch size using max_num_batched_tokens or max_num_seqs . This parameter can be passed in both Engine or LLM class.
Lines 28 to 29 in 1a2bbc9
Lines 151 to 159 in 1a2bbc9
These are two different parameters. num_batched_tokens dictates how many tokens per forward pass, each sequence can have multiple token running at the same time (for example during the pre-fill stage). The sequence size is probably what you are looking for, which is a bit higher level.
I hope this can solve the issue. Feel free to re-open the issue if this is not resolved!
In the below code, the prompts is a list and has 4 prompt, does it means the batch size is 4? @simon-mo
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
and it seems the self.max_model_len is = prompt length + output token length
Yes because the default max sequence is 256 by default.