TGI Performance script run_generation.py missing Throughput Info

Question

TGI Performance script run_generation.py missing Throughput Info

jingkang99 opened this issue 7 months ago · 3 comments

System Info

Ubuntu 22.04
ghcr.io/huggingface/tgi-gaudi 2.0.0

cd tgi-gaudi/examples
python run_generation.py --model_id meta-llama/Llama-2-7b-hf --max_concurrent_requests 50 --max_input_length 200 --max_output_length 200 --total_sample_count 200

100%|████████████| 200/200 [01:08<00:00, 2.92it/s]

----- Performance summary -----

Throughput: 0.0 tokens/s
Throughput: 0.0 queries/s

First token latency:
Median: 18783.24ms
Average: 16835.90ms

Output token latency:
Median: 14.22ms
Average: 15.22ms

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

btw, if not specify

--max_concurrent_requests 50

get following error

Thread failed with error: Model is overloaded

Expected behavior

Throughput info calculated and shown in

Throughput: 0.0 tokens/s
Throughput: 0.0 queries/s

Answer 1 · 2024-05-27T07:37:28.000Z

@jingkang99 please share also your TGI server command

Answer 2 · 2024-05-28T09:58:44.000Z

@jingkang99 I think I know what is the issue in your case -- please check what version of huggingface_hub you have installed in your env. The newest version has an issue with the details=True in stream mode: huggingface#1876.
To resolve this issue, please install requirements as mentioned in the README.

Answer 3 · 2024-05-30T19:28:25.000Z

Thanks a lot for your input.

MUST install exact version:

huggingface_hub==0.20.3
requests==2.31.0
datasets==2.18.0
transformers==4.37.0

result:
time python run_generation.py --model_id meta-llama/Llama-2-7b-hf --max_concurrent_requests 100 --max_input_length 1000 --max_output_length 1000 --total_sample_count 1000

Filter: 100%|████████| 10331/10331 [00:03<00:00, 3380.24 examples/s]
100%|████████████| 1000/1000 [29:31<00:00, 1.77s/it]

----- Performance summary -----

Throughput: 479.7 tokens/s
Throughput: 0.5 queries/s

First token latency:
Median: 181037.75ms
Average: 173501.17ms

Output token latency:
Median: 14.01ms
Average: 14.51ms

real 33m0.485s
user 4m22.686s
sys 0m49.261s

System Info

100%|████████████| 200/200 [01:08<00:00, 2.92it/s]

----- Performance summary -----

Throughput: 0.0 tokens/s Throughput: 0.0 queries/s

First token latency: Median: 18783.24ms Average: 16835.90ms

Output token latency: Median: 14.22ms Average: 15.22ms

Information

Tasks

Reproduction

Expected behavior

Filter: 100%|████████| 10331/10331 [00:03<00:00, 3380.24 examples/s] 100%|████████████| 1000/1000 [29:31<00:00, 1.77s/it]

----- Performance summary -----

Throughput: 479.7 tokens/s Throughput: 0.5 queries/s

First token latency: Median: 181037.75ms Average: 173501.17ms

Output token latency: Median: 14.01ms Average: 14.51ms

Throughput: 0.0 tokens/s
Throughput: 0.0 queries/s

First token latency:
Median: 18783.24ms
Average: 16835.90ms

Output token latency:
Median: 14.22ms
Average: 15.22ms

Filter: 100%|████████| 10331/10331 [00:03<00:00, 3380.24 examples/s]
100%|████████████| 1000/1000 [29:31<00:00, 1.77s/it]

Throughput: 479.7 tokens/s
Throughput: 0.5 queries/s

First token latency:
Median: 181037.75ms
Average: 173501.17ms

Output token latency:
Median: 14.01ms
Average: 14.51ms