a compare with vllm 0.2.7
Closed this issue · 7 comments
System Info
ubuntu22.04
one Nvidia A800
driver info: 470.141.10
cuda: 12.3
tensorrt: 9.2.0.5
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
set Concurrency = 16.
when I use VLLM=0.2.7. use vllm asyncengine.
min response time is 367 ms
max response time is 676 ms
I build Tensorrt-LLM and tensorrtllm_backend from the main branch.
use tritonserver deploy the model. test result is:
min response is 379 ms
max response is 4418 ms
Tensorrt-LLM is much lowest than VLLM0.2.7?
I think Tensorrt-LLM should be faster than VLLM.
why? any suggestion?
Expected behavior
NULL
actual behavior
NULL
additional notes
Null
@Coder-nlper Please share your commands to build the engines and benchmarks so that we can check if the comparison is apple-to-apple. Thanks.
commands to build:
hf_model_path=/root/chatglm3-6b/
engine_dir=/root/trtllm/trtllmmodels/fp16/
CUDA_ID="1"
CUDA_VISIBLE_DEVICES=$CUDA_ID python3 build.py
--model_dir $hf_model_path
--log_level "info"
--output_dir $engine_dir/1-gpu
--world_size 1
--tp_size 1
--max_batch_size 50
--max_input_len 2048
--max_output_len 512
--max_beam_width 1
--enable_context_fmha
--use_inflight_batching
--paged_kv_cache
--remove_input_padding
test use ab command:
ip=localhost
port=60025
ab -n 64 -c 16 -p "post.txt" -T "application/json" -H "Content-Type: application/json" -H "Cache-Control: no-cache" -H "Postman-Token: d6xxs-sdf-sdf09d" "http://$ip:$port/model/generate"
Driver is 470.141.10. is there any relationship with it?
Driver is 470.141.10. is there any relationship with it?
I have to imagine that likely isn't ideal.
While it's supported per the Nvidia Frameworks Support Matrix you may want to try comparing to 535 for the current 23.10 based containers. Also be aware main has moved to 23.12 which would be 545 (CUDA 12.3).
Driver is 470.141.10. is there any relationship with it?
I have to imagine that likely isn't ideal.
While it's supported per the Nvidia Frameworks Support Matrix you may want to try comparing to 535 for the current 23.10 based containers. Also be aware main has moved to 23.12 which would be 545 (CUDA 12.3).
but the latest is 535.x.x

Driver is 470.141.10. is there any relationship with it?
I have to imagine that likely isn't ideal.
While it's supported per the Nvidia Frameworks Support Matrix you may want to try comparing to 535 for the current 23.10 based containers. Also be aware main has moved to 23.12 which would be 545 (CUDA 12.3).but the latest is 535.x.x
![]()
stable version is 535, dev version is 545, beta version is 550.
refer link
Hi @white-wolf-tech do u still have further issue or question now? If not, we'll close it soon.