TGI Performance script run_generation.py missing Throughput Info
jingkang99 opened this issue · 3 comments
System Info
Ubuntu 22.04
ghcr.io/huggingface/tgi-gaudi 2.0.0
cd tgi-gaudi/examples
python run_generation.py --model_id meta-llama/Llama-2-7b-hf --max_concurrent_requests 50 --max_input_length 200 --max_output_length 200 --total_sample_count 200
100%|████████████| 200/200 [01:08<00:00, 2.92it/s]
----- Performance summary -----
Throughput: 0.0 tokens/s
Throughput: 0.0 queries/s
First token latency:
Median: 18783.24ms
Average: 16835.90ms
Output token latency:
Median: 14.22ms
Average: 15.22ms
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
btw, if not specify
--max_concurrent_requests 50
get following error
Thread failed with error: Model is overloaded
Expected behavior
Throughput info calculated and shown in
Throughput: 0.0 tokens/s
Throughput: 0.0 queries/s
@jingkang99 please share also your TGI server command
@jingkang99 I think I know what is the issue in your case -- please check what version of huggingface_hub
you have installed in your env. The newest version has an issue with the details=True
in stream mode: huggingface#1876.
To resolve this issue, please install requirements as mentioned in the README.
Thanks a lot for your input.
MUST install exact version:
huggingface_hub==0.20.3
requests==2.31.0
datasets==2.18.0
transformers==4.37.0
result:
time python run_generation.py --model_id meta-llama/Llama-2-7b-hf --max_concurrent_requests 100 --max_input_length 1000 --max_output_length 1000 --total_sample_count 1000
Filter: 100%|████████| 10331/10331 [00:03<00:00, 3380.24 examples/s]
100%|████████████| 1000/1000 [29:31<00:00, 1.77s/it]
----- Performance summary -----
Throughput: 479.7 tokens/s
Throughput: 0.5 queries/s
First token latency:
Median: 181037.75ms
Average: 173501.17ms
Output token latency:
Median: 14.01ms
Average: 14.51ms
real 33m0.485s
user 4m22.686s
sys 0m49.261s