How do I set the batch size for vLLM

##Question

If I run the vllm offline and can I set the batch size ? I mean I want to test the its e2e latency for different batch size.

For offline inference, you can set the max batch size using max_num_batched_tokens or max_num_seqs . This parameter can be passed in both Engine or LLM class.

vllm/vllm/engine/arg_utils.py

Lines 28 to 29 in 1a2bbc9

    
           max_num_batched_tokens: Optional[int] = None 
        
           max_num_seqs: int = 256

vllm/vllm/engine/arg_utils.py

Lines 151 to 159 in 1a2bbc9

    
           parser.add_argument('--max-num-batched-tokens', 
        
                               type=int, 
        
                               default=EngineArgs.max_num_batched_tokens, 
        
                               help='maximum number of batched tokens per ' 
        
                               'iteration') 
        
           parser.add_argument('--max-num-seqs', 
        
                               type=int, 
        
                               default=EngineArgs.max_num_seqs, 
        
                               help='maximum number of sequences per iteration')

These are two different parameters. num_batched_tokens dictates how many tokens per forward pass, each sequence can have multiple token running at the same time (for example during the pre-fill stage). The sequence size is probably what you are looking for, which is a bit higher level.

I hope this can solve the issue. Feel free to re-open the issue if this is not resolved!

In the below code, the prompts is a list and has 4 prompt, does it means the batch size is 4? @simon-mo

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

and it seems the self.max_model_len is = prompt length + output token length

Yes because the default max sequence is 256 by default.

	max_num_batched_tokens: Optional[int] = None
	max_num_seqs: int = 256

	parser.add_argument('--max-num-batched-tokens',
	type=int,
	default=EngineArgs.max_num_batched_tokens,
	help='maximum number of batched tokens per '
	'iteration')
	parser.add_argument('--max-num-seqs',
	type=int,
	default=EngineArgs.max_num_seqs,
	help='maximum number of sequences per iteration')