low throughput while using TGI-Gaudi on bigcode/starcoderbase-3b on Gaudi2
vishnumadhu365 opened this issue · 1 comments
vishnumadhu365 commented
System Info
tgi-gaudi docker container built from master branch (4fe871f)
Ubuntu 22.04.3 LTS
Gaudi2
HL-SMI Version: hl-1.15.0-fw-48.2.1.1
Driver Version: 1.15.0-a596ef0
Model : bigcode/starcoderbase-3b
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
Steps
- Docker run
docker run -it -p 8080:80 -v $volume:/data --runtime=habana \
-e HABANA_VISIBLE_DEVICES=all \
-e HUGGING_FACE_HUB_TOKEN=1234 \
-e OMPI_MCA_btl_vader_single_copy_mechanism=none \
-e ENABLE_HPU_GRAPH=False -e BATCH_BUCKET_SIZE=128 \
-e PREFILL_BATCH_BUCKET_SIZE=4 \
-e PAD_SEQUENCE_TO_MULTIPLE_OF=128 \
--cap-add=sys_nice \
--ipc=host tgi-gaudi:latest \
--model-id $model \
--max-input-tokens 568 \
--max-batch-prefill-tokens 618 \
--max-total-tokens 614 \
--max-batch-total-tokens 78592
- Measure perf of TGI endpoint with tgi-gaudi/examples
python3 run_generation.py \
--model_id $model \
--server_address http://localhost:8080 \
--max_input_length 568 \
--max_output_length 46 \
--total_sample_count 1280 \
--max_concurrent_requests 128
output:
--------------------------------
----- Performance summary -----
--------------------------------
Throughput: 98.8 tokens/s
Throughput: 2.2 queries/s
--------------------------------
First token latency:
Median: 54734.41ms
Average: 52755.73ms
--------------------------------
Output token latency:
Median: 58.47ms
Average: 69.58ms
--------------------------------
- Run Static benchmark from within tgi container
text-generation-benchmark -b 128 -b 64 -b 32 -b 16 -b 8 -b 4 -b 2 -b 1 -s 567 -d 46 -w 5 -r 100 -t bigcode/starcoderbase-3b
Expected behavior
Issue:
Throughput numbers while hitting the TGI endpoint is way off from the static benchmark throughput.
Server logs suggest there is some issue with continuous batching on GD2.
#testing by sending 5 request on Gaudi2 TGI endpoint. Note that the queue time is increasing for subsequent inference requests
Req1: total_time="3.076226394s" validation_time="449.063µs" queue_time="110.028µs" inference_time="3.075667684s" time_per_token="66.86234ms"
Req2: total_time="3.076173218s" validation_time="3.502745ms" queue_time="70.64658ms" inference_time="3.002024052s" time_per_token="65.261392ms"
Req3: total_time="3.132718439s" validation_time=""786.778µs" queue_time="201.632982ms" inference_time="2.930298993s" time_per_token="63.702152ms"
Req4: total_time="3.197355097s" validation_time="1.277488ms" queue_time="331.050014ms" inference_time="2.865027991s" time_per_token="62.283217ms"
Req5: total_time="3.259123777s" validation_time="924.292µs" queue_time="459.104331ms" inference_time="2.799095535s" time_per_token="60.849902ms"
#Same test as above this time sending 5 requests to a single Nvidia T4 card running TGI docker 2.0.4. Note that the queue time is more or less constant after the first request indicating effective continuous batching
Req1: total_time="1.513475533s" validation_time="1.069695ms" queue_time="52.017µs" inference_time="1.512354236s" time_per_token="32.877266ms"
Req2: total_time="1.507096983s" validation_time="799.031µs" queue_time="54.518157ms" inference_time="1.451780025s" time_per_token="31.560435ms"
Req3: total_time="1.502753387s" validation_time="418.679µs" queue_time="50.525381ms" inference_time="1.451809782s" time_per_token="31.561082ms"
Req4: total_time="1.507244713s" validation_time="841.468µs" queue_time="54.479958ms" inference_time="1.451923498s" time_per_token="31.563554ms"
Req5: total_time="1.503086631s" validation_time="828.972µs" queue_time="50.359691ms" inference_time="1.451898309s" time_per_token="31.563006ms"
Expected result:
Gaudi 2 throughput numbers on the TGI endpoint (with continuous batching) should be at par or better than the static benchmark throughput