anyscale/llm-continuous-batching-benchmarks

Can't reproduce benchmark results

Opened this issue · 4 comments

I ran the floowing benchmark scripts:

  1. benchmark_configs/vllm_variable_size
  2. benchmark_configs/vllm_variable_size_latency

The results I got deviate from the ones published in the blog.
The throughput results are between 6% and 14% lower than the expected ones:

Throughput max tokens 32 max tokens 128 max tokens 512 max tokens 1536
Expected 6121 3592 2029 1898
Actual 5752 3180 1734 1653

For qps=1 the latency is the same, but for qps=4 it's 54% worse.

Latency qps 1 qps 4
Expected 3.6 4.6
Actual 3.6 7.1

Setup details:

Linux 5.10.176+ #1 SMP Sat May 6 15:10:33 UTC 2023 x86_64 GNU/Linux
NVIDIA Driver Version: 525.105.17
GPU: 1 x NVIDIA A100-SXM4-40GB
Python 3.10.6
vLLM 0.1.2
CUDA 11.8
Torch 2.0.1+cu118
Transformers 4.30.1

Can you explain what might cause the performance difference?

Note: I had to fix bad imports in launch_scripts/launch_vllm to make it work (for example ServerArgs => EngineArgs)

Below are the detailed results (of my runs):

vllm_range_32_2023-08-02_19:08:59.log:
backend vLLM dur_s 91.73 tokens_per_s 5751.92 qps 10.90 successful_responses 1000 prompt_token_count 512000 response_token_count 15610, median_token_latency=3.129498200757163, median_e2e_latency=49.0919646024704

vllm_range_128_2023-08-02_19:11:07.log:
backend vLLM dur_s 178.07 tokens_per_s 3180.34 qps 5.62 successful_responses 1000 prompt_token_count 512000 response_token_count 54331, median_token_latency=1.7179552376270295, median_e2e_latency=90.2003003358841

vllm_range_512_2023-08-02_19:14:42.log:
backend vLLM dur_s 364.71 tokens_per_s 1738.68 qps 2.74 successful_responses 1000 prompt_token_count 512000 response_token_count 122108, median_token_latency=1.7062032730021375, median_e2e_latency=181.4516226053238

vllm_range_1536_2023-08-02_19:21:23.log:
backend vLLM dur_s 387.94 tokens_per_s 1653.79 qps 2.58 successful_responses 1000 prompt_token_count 512000 response_token_count 129570, median_token_latency=1.7450935804206906, median_e2e_latency=193.9225560426712

vllm_qps_1_numprompts_5000_range_1536_2023-08-02_22:22:20.log:
backend vLLM dur_s 5024.24 tokens_per_s 382.84 qps 1.00 successful_responses 5000 prompt_token_count 1290681 response_token_count 632820, median_token_latency=0.0408235232035319, median_e2e_latency=3.566364049911499

vllm_qps_4_numprompts_5000_range_1536_2023-08-02_23:46:54.log:
backend vLLM dur_s 1279.33 tokens_per_s 1503.52 qps 3.91 successful_responses 5000 prompt_token_count 1290681 response_token_count 632820, median_token_latency=0.0629778996757839, median_e2e_latency=7.079227566719055

vllm_qps_8_numprompts_5000_range_1536_2023-08-02_19:30:41.log:
backend vLLM dur_s 1180.98 tokens_per_s 1628.74 qps 4.23 successful_responses 5000 prompt_token_count 1290681 response_token_count 632820, median_token_latency=2.7175515592098236, median_e2e_latency=267.9675291776657

vllm_qps_16_numprompts_5000_range_1536_2023-08-02_19:51:12.log:
backend vLLM dur_s 1175.71 tokens_per_s 1636.03 qps 4.25 successful_responses 5000 prompt_token_count 1290681 response_token_count 632820, median_token_latency=4.252413267313048, median_e2e_latency=419.6503413915634

vllm_qps_32_numprompts_5000_range_1536_2023-08-02_20:11:37.log:
backend vLLM dur_s 1175.90 tokens_per_s 1635.77 qps 4.25 successful_responses 5000 prompt_token_count 1290681 response_token_count 632820, median_token_latency=5.0277330653612005, median_e2e_latency=496.7233476638794

Hi! We ran on the June 19th version. I believe the newer versions auto-configure the number of GPU blocks available to vLLM, up to 90% of GPU memory. Can you share how many GPU blocks / size of GPU block are present when you run OPT-13B on the A100-40GB?

Also, we found that the number of GPUs vLLM has access to can impact throughput. Is there only one GPU on your machine?

I'm using a single A100-40GB GPU. I'm running on GCP, so I imagine the physical machine itself has more than one such GPU, but my VM has only one at its disposal.
When running the vllm server, I get the following:

# GPU blocks: 866, # CPU blocks: 327

I believe the block_size used is the default one in version 0.1.2, which is 16.

If it helps, I ran the qps=4 latency benchmark on a single A100-80GB GPU (instead of a 40GB one) with --swap-space 0 and a changing --gpu-memory-utilization and I got the following results:

  • median e2e latency 4.8 when using --gpu-memory-utilization=0.9 (# GPU blocks: 3797, # CPU blocks: 0).
  • median e2e latency 5.3 when using --gpu-memory-utilization=0.45 (to "resemble" the 40GB run - # GPU blocks: 879, # CPU blocks: 0).

Just a reminder, with the single A100-40GB I got a median e2e latency of 7.1 (for qps=4)

If it helps, I ran the qps=4 latency benchmark on a single A100-80GB GPU (instead of a 40GB one) with --swap-space 0 and a changing --gpu-memory-utilization and I got the following results:

  • median e2e latency 4.8 when using --gpu-memory-utilization=0.9 (# GPU blocks: 3797, # CPU blocks: 0).
  • median e2e latency 5.3 when using --gpu-memory-utilization=0.45 (to "resemble" the 40GB run - # GPU blocks: 879, # CPU blocks: 0).

Just a reminder, with the single A100-40GB I got a median e2e latency of 7.1 (for qps=4)

Hey, I got same import problems, could you please tell me what should I do for "cannot import name'LLMServer'from 'vllm'"? Thanks.