Can't reproduce benchmark results

Question

Can't reproduce benchmark results

Opened this issue a year ago · 4 comments

I ran the floowing benchmark scripts:

benchmark_configs/vllm_variable_size
benchmark_configs/vllm_variable_size_latency

The results I got deviate from the ones published in the blog.
The throughput results are between 6% and 14% lower than the expected ones:

Throughput	max tokens 32	max tokens 128	max tokens 512	max tokens 1536
Expected	6121	3592	2029	1898
Actual	5752	3180	1734	1653

For qps=1 the latency is the same, but for qps=4 it's 54% worse.

Latency	qps 1	qps 4
Expected	3.6	4.6
Actual	3.6	7.1

Setup details:

Linux 5.10.176+ #1 SMP Sat May 6 15:10:33 UTC 2023 x86_64 GNU/Linux
NVIDIA Driver Version: 525.105.17
GPU: 1 x NVIDIA A100-SXM4-40GB
Python 3.10.6
vLLM 0.1.2
CUDA 11.8
Torch 2.0.1+cu118
Transformers 4.30.1

Can you explain what might cause the performance difference?

Note: I had to fix bad imports in launch_scripts/launch_vllm to make it work (for example ServerArgs => EngineArgs)

Below are the detailed results (of my runs):

vllm_range_32_2023-08-02_19:08:59.log:
backend vLLM dur_s 91.73 tokens_per_s 5751.92 qps 10.90 successful_responses 1000 prompt_token_count 512000 response_token_count 15610, median_token_latency=3.129498200757163, median_e2e_latency=49.0919646024704

vllm_range_128_2023-08-02_19:11:07.log:
backend vLLM dur_s 178.07 tokens_per_s 3180.34 qps 5.62 successful_responses 1000 prompt_token_count 512000 response_token_count 54331, median_token_latency=1.7179552376270295, median_e2e_latency=90.2003003358841

vllm_range_512_2023-08-02_19:14:42.log:
backend vLLM dur_s 364.71 tokens_per_s 1738.68 qps 2.74 successful_responses 1000 prompt_token_count 512000 response_token_count 122108, median_token_latency=1.7062032730021375, median_e2e_latency=181.4516226053238

vllm_range_1536_2023-08-02_19:21:23.log:
backend vLLM dur_s 387.94 tokens_per_s 1653.79 qps 2.58 successful_responses 1000 prompt_token_count 512000 response_token_count 129570, median_token_latency=1.7450935804206906, median_e2e_latency=193.9225560426712

vllm_qps_1_numprompts_5000_range_1536_2023-08-02_22:22:20.log:
backend vLLM dur_s 5024.24 tokens_per_s 382.84 qps 1.00 successful_responses 5000 prompt_token_count 1290681 response_token_count 632820, median_token_latency=0.0408235232035319, median_e2e_latency=3.566364049911499

vllm_qps_4_numprompts_5000_range_1536_2023-08-02_23:46:54.log:
backend vLLM dur_s 1279.33 tokens_per_s 1503.52 qps 3.91 successful_responses 5000 prompt_token_count 1290681 response_token_count 632820, median_token_latency=0.0629778996757839, median_e2e_latency=7.079227566719055

vllm_qps_8_numprompts_5000_range_1536_2023-08-02_19:30:41.log:
backend vLLM dur_s 1180.98 tokens_per_s 1628.74 qps 4.23 successful_responses 5000 prompt_token_count 1290681 response_token_count 632820, median_token_latency=2.7175515592098236, median_e2e_latency=267.9675291776657

vllm_qps_16_numprompts_5000_range_1536_2023-08-02_19:51:12.log:
backend vLLM dur_s 1175.71 tokens_per_s 1636.03 qps 4.25 successful_responses 5000 prompt_token_count 1290681 response_token_count 632820, median_token_latency=4.252413267313048, median_e2e_latency=419.6503413915634

vllm_qps_32_numprompts_5000_range_1536_2023-08-02_20:11:37.log:
backend vLLM dur_s 1175.90 tokens_per_s 1635.77 qps 4.25 successful_responses 5000 prompt_token_count 1290681 response_token_count 632820, median_token_latency=5.0277330653612005, median_e2e_latency=496.7233476638794

Answer 1 · 2023-08-07T15:40:21.000Z

Hi! We ran on the June 19th version. I believe the newer versions auto-configure the number of GPU blocks available to vLLM, up to 90% of GPU memory. Can you share how many GPU blocks / size of GPU block are present when you run OPT-13B on the A100-40GB?

Also, we found that the number of GPUs vLLM has access to can impact throughput. Is there only one GPU on your machine?

Answer 2 · 2023-08-10T06:42:50.000Z

I'm using a single A100-40GB GPU. I'm running on GCP, so I imagine the physical machine itself has more than one such GPU, but my VM has only one at its disposal.
When running the vllm server, I get the following:

# GPU blocks: 866, # CPU blocks: 327

I believe the block_size used is the default one in version 0.1.2, which is 16.

Answer 3 · 2023-08-13T08:24:21.000Z

If it helps, I ran the qps=4 latency benchmark on a single A100-80GB GPU (instead of a 40GB one) with --swap-space 0 and a changing --gpu-memory-utilization and I got the following results:

median e2e latency 4.8 when using --gpu-memory-utilization=0.9 (# GPU blocks: 3797, # CPU blocks: 0).
median e2e latency 5.3 when using --gpu-memory-utilization=0.45 (to "resemble" the 40GB run - # GPU blocks: 879, # CPU blocks: 0).

Just a reminder, with the single A100-40GB I got a median e2e latency of 7.1 (for qps=4)

Answer 4 · 2023-12-01T02:23:00.000Z

If it helps, I ran the qps=4 latency benchmark on a single A100-80GB GPU (instead of a 40GB one) with --swap-space 0 and a changing --gpu-memory-utilization and I got the following results:

median e2e latency 4.8 when using --gpu-memory-utilization=0.9 (# GPU blocks: 3797, # CPU blocks: 0).

median e2e latency 5.3 when using --gpu-memory-utilization=0.45 (to "resemble" the 40GB run - # GPU blocks: 879, # CPU blocks: 0).

Just a reminder, with the single A100-40GB I got a median e2e latency of 7.1 (for qps=4)

Hey, I got same import problems, could you please tell me what should I do for "cannot import name'LLMServer'from 'vllm'"? Thanks.