Can't reproduce benchmark results
Opened this issue · 4 comments
I ran the floowing benchmark scripts:
benchmark_configs/vllm_variable_size
benchmark_configs/vllm_variable_size_latency
The results I got deviate from the ones published in the blog.
The throughput results are between 6% and 14% lower than the expected ones:
Throughput | max tokens 32 | max tokens 128 | max tokens 512 | max tokens 1536 |
---|---|---|---|---|
Expected | 6121 | 3592 | 2029 | 1898 |
Actual | 5752 | 3180 | 1734 | 1653 |
For qps=1
the latency is the same, but for qps=4
it's 54% worse.
Latency | qps 1 | qps 4 |
---|---|---|
Expected | 3.6 | 4.6 |
Actual | 3.6 | 7.1 |
Setup details:
Linux 5.10.176+ #1 SMP Sat May 6 15:10:33 UTC 2023 x86_64 GNU/Linux
NVIDIA Driver Version: 525.105.17
GPU: 1 x NVIDIA A100-SXM4-40GB
Python 3.10.6
vLLM 0.1.2
CUDA 11.8
Torch 2.0.1+cu118
Transformers 4.30.1
Can you explain what might cause the performance difference?
Note: I had to fix bad imports in launch_scripts/launch_vllm
to make it work (for example ServerArgs => EngineArgs
)
Below are the detailed results (of my runs):
vllm_range_32_2023-08-02_19:08:59.log:
backend vLLM dur_s 91.73 tokens_per_s 5751.92 qps 10.90 successful_responses 1000 prompt_token_count 512000 response_token_count 15610, median_token_latency=3.129498200757163, median_e2e_latency=49.0919646024704
vllm_range_128_2023-08-02_19:11:07.log:
backend vLLM dur_s 178.07 tokens_per_s 3180.34 qps 5.62 successful_responses 1000 prompt_token_count 512000 response_token_count 54331, median_token_latency=1.7179552376270295, median_e2e_latency=90.2003003358841
vllm_range_512_2023-08-02_19:14:42.log:
backend vLLM dur_s 364.71 tokens_per_s 1738.68 qps 2.74 successful_responses 1000 prompt_token_count 512000 response_token_count 122108, median_token_latency=1.7062032730021375, median_e2e_latency=181.4516226053238
vllm_range_1536_2023-08-02_19:21:23.log:
backend vLLM dur_s 387.94 tokens_per_s 1653.79 qps 2.58 successful_responses 1000 prompt_token_count 512000 response_token_count 129570, median_token_latency=1.7450935804206906, median_e2e_latency=193.9225560426712
vllm_qps_1_numprompts_5000_range_1536_2023-08-02_22:22:20.log:
backend vLLM dur_s 5024.24 tokens_per_s 382.84 qps 1.00 successful_responses 5000 prompt_token_count 1290681 response_token_count 632820, median_token_latency=0.0408235232035319, median_e2e_latency=3.566364049911499
vllm_qps_4_numprompts_5000_range_1536_2023-08-02_23:46:54.log:
backend vLLM dur_s 1279.33 tokens_per_s 1503.52 qps 3.91 successful_responses 5000 prompt_token_count 1290681 response_token_count 632820, median_token_latency=0.0629778996757839, median_e2e_latency=7.079227566719055
vllm_qps_8_numprompts_5000_range_1536_2023-08-02_19:30:41.log:
backend vLLM dur_s 1180.98 tokens_per_s 1628.74 qps 4.23 successful_responses 5000 prompt_token_count 1290681 response_token_count 632820, median_token_latency=2.7175515592098236, median_e2e_latency=267.9675291776657
vllm_qps_16_numprompts_5000_range_1536_2023-08-02_19:51:12.log:
backend vLLM dur_s 1175.71 tokens_per_s 1636.03 qps 4.25 successful_responses 5000 prompt_token_count 1290681 response_token_count 632820, median_token_latency=4.252413267313048, median_e2e_latency=419.6503413915634
vllm_qps_32_numprompts_5000_range_1536_2023-08-02_20:11:37.log:
backend vLLM dur_s 1175.90 tokens_per_s 1635.77 qps 4.25 successful_responses 5000 prompt_token_count 1290681 response_token_count 632820, median_token_latency=5.0277330653612005, median_e2e_latency=496.7233476638794
Hi! We ran on the June 19th version. I believe the newer versions auto-configure the number of GPU blocks available to vLLM, up to 90% of GPU memory. Can you share how many GPU blocks / size of GPU block are present when you run OPT-13B on the A100-40GB?
Also, we found that the number of GPUs vLLM has access to can impact throughput. Is there only one GPU on your machine?
I'm using a single A100-40GB GPU. I'm running on GCP, so I imagine the physical machine itself has more than one such GPU, but my VM has only one at its disposal.
When running the vllm server, I get the following:
# GPU blocks: 866, # CPU blocks: 327
I believe the block_size
used is the default one in version 0.1.2, which is 16
.
If it helps, I ran the qps=4
latency benchmark on a single A100-80GB
GPU (instead of a 40GB
one) with --swap-space 0
and a changing --gpu-memory-utilization
and I got the following results:
- median e2e latency
4.8
when using--gpu-memory-utilization=0.9
(# GPU blocks: 3797, # CPU blocks: 0
). - median e2e latency
5.3
when using--gpu-memory-utilization=0.45
(to "resemble" the40GB
run -# GPU blocks: 879, # CPU blocks: 0
).
Just a reminder, with the single A100-40GB
I got a median e2e latency of 7.1
(for qps=4
)
If it helps, I ran the
qps=4
latency benchmark on a singleA100-80GB
GPU (instead of a40GB
one) with--swap-space 0
and a changing--gpu-memory-utilization
and I got the following results:
- median e2e latency
4.8
when using--gpu-memory-utilization=0.9
(# GPU blocks: 3797, # CPU blocks: 0
).- median e2e latency
5.3
when using--gpu-memory-utilization=0.45
(to "resemble" the40GB
run -# GPU blocks: 879, # CPU blocks: 0
).Just a reminder, with the single
A100-40GB
I got a median e2e latency of7.1
(forqps=4
)
Hey, I got same import problems, could you please tell me what should I do for "cannot import name'LLMServer'from 'vllm'"? Thanks.