vllm-project/vllm

How do I set the batch size for vLLM

lambda7xx opened this issue · 3 comments

##Question

  • If I run the vllm offline and can I set the batch size ? I mean I want to test the its e2e latency for different batch size.

For offline inference, you can set the max batch size using max_num_batched_tokens or max_num_seqs . This parameter can be passed in both Engine or LLM class.

max_num_batched_tokens: Optional[int] = None
max_num_seqs: int = 256

parser.add_argument('--max-num-batched-tokens',
type=int,
default=EngineArgs.max_num_batched_tokens,
help='maximum number of batched tokens per '
'iteration')
parser.add_argument('--max-num-seqs',
type=int,
default=EngineArgs.max_num_seqs,
help='maximum number of sequences per iteration')

These are two different parameters. num_batched_tokens dictates how many tokens per forward pass, each sequence can have multiple token running at the same time (for example during the pre-fill stage). The sequence size is probably what you are looking for, which is a bit higher level.

I hope this can solve the issue. Feel free to re-open the issue if this is not resolved!

In the below code, the prompts is a list and has 4 prompt, does it means the batch size is 4? @simon-mo

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

and it seems the self.max_model_len is = prompt length + output token length

Yes because the default max sequence is 256 by default.