huggingface/tgi-gaudi

TGI Performance script run_generation.py missing Throughput Info

jingkang99 opened this issue · 3 comments

System Info

Ubuntu 22.04
ghcr.io/huggingface/tgi-gaudi 2.0.0

cd tgi-gaudi/examples
python run_generation.py --model_id meta-llama/Llama-2-7b-hf --max_concurrent_requests 50 --max_input_length 200 --max_output_length 200 --total_sample_count 200

100%|████████████| 200/200 [01:08<00:00, 2.92it/s]

----- Performance summary -----

Throughput: 0.0 tokens/s
Throughput: 0.0 queries/s

First token latency:
Median: 18783.24ms
Average: 16835.90ms

Output token latency:
Median: 14.22ms
Average: 15.22ms

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

btw, if not specify

--max_concurrent_requests 50

get following error

Thread failed with error: Model is overloaded

Expected behavior

Throughput info calculated and shown in

Throughput: 0.0 tokens/s
Throughput: 0.0 queries/s

@jingkang99 please share also your TGI server command

@jingkang99 I think I know what is the issue in your case -- please check what version of huggingface_hub you have installed in your env. The newest version has an issue with the details=True in stream mode: huggingface#1876.
To resolve this issue, please install requirements as mentioned in the README.

Thanks a lot for your input.

MUST install exact version:

huggingface_hub==0.20.3
requests==2.31.0
datasets==2.18.0
transformers==4.37.0

result:
time python run_generation.py --model_id meta-llama/Llama-2-7b-hf --max_concurrent_requests 100 --max_input_length 1000 --max_output_length 1000 --total_sample_count 1000

Filter: 100%|████████| 10331/10331 [00:03<00:00, 3380.24 examples/s]
100%|████████████| 1000/1000 [29:31<00:00, 1.77s/it]

----- Performance summary -----

Throughput: 479.7 tokens/s
Throughput: 0.5 queries/s

First token latency:
Median: 181037.75ms
Average: 173501.17ms

Output token latency:
Median: 14.01ms
Average: 14.51ms

real 33m0.485s
user 4m22.686s
sys 0m49.261s