https://github.com/huggingface/tgi-gaudi/pull/176 causes performance regression for benchmark
mandy-li opened this issue · 3 comments
System Info
build the docker with latest in habana-main branch, inside docker, run:
text-generation-benchmark -t mistralai/Mixtral-8x7B-Instruct-v0.1
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
Run the benchmark for Mixtral 8x7B on 2x HPUs.
with this PR:
Step | Batch Size | Average | Lowest | Highest |
---|---|---|---|---|
Prefill | 1 | 3.75 tokens/secs | 3.23 tokens/secs | 3.87 tokens/secs |
2 | 7.73 tokens/secs | 7.31 tokens/secs | 8.05 tokens/secs | |
4 | 15.85 tokens/secs | 15.64 tokens/secs | 16.06 tokens/secs | |
8 | 287.04 tokens/secs | 281.86 tokens/secs | 293.99 tokens/secs | |
16 | 510.34 tokens/secs | 481.08 tokens/secs | 547.25 tokens/secs | |
32 | 880.87 tokens/secs | 857.41 tokens/secs | 903.95 tokens/secs | |
Decode | 1 | 16.76 tokens/secs | 16.64 tokens/secs | 16.84 tokens/secs |
2 | 33.53 tokens/secs | 32.69 tokens/secs | 34.22 tokens/secs | |
4 | 68.43 tokens/secs | 68.16 tokens/secs | 68.84 tokens/secs | |
8 | 225.42 tokens/secs | 224.82 tokens/secs | 225.82 tokens/secs | |
16 | 404.56 tokens/secs | 401.84 tokens/secs | 407.02 tokens/secs | |
32 | 653.55 tokens/secs | 651.50 tokens/secs | 655.44 tokens/secs |
Without this PR:
Step | Batch Size | Average | Lowest | Highest |
---|---|---|---|---|
Prefill | 1 | 38.37 tokens/secs | 38.29 tokens/secs | 38.57 tokens/secs |
2 | 77.02 tokens/secs | 76.86 tokens/secs | 77.70 tokens/secs | |
4 | 154.88 tokens/secs | 154.75 tokens/secs | 155.11 tokens/secs | |
8 | 301.49 tokens/secs | 250.01 tokens/secs | 316.98 tokens/secs | |
16 | 592.47 tokens/secs | 544.04 tokens/secs | 607.58 tokens/secs | |
32 | 970.91 tokens/secs | 740.86 tokens/secs | 1027.14 tokens/secs | |
Decode | 1 | 28.60 tokens/secs | 28.59 tokens/secs | 28.61 tokens/secs |
2 | 57.17 tokens/secs | 57.13 tokens/secs | 57.19 tokens/secs | |
4 | 114.30 tokens/secs | 114.26 tokens/secs | 114.33 tokens/secs | |
8 | 224.53 tokens/secs | 223.37 tokens/secs | 228.83 tokens/secs | |
16 | 399.37 tokens/secs | 398.59 tokens/secs | 402.06 tokens/secs | |
32 | 648.88 tokens/secs | 645.33 tokens/secs | 665.85 tokens/secs |
Expected behavior
no regression
I'm checking it
I've tried to reproduce it on the llama-7b, here are results:
w/o this PR
Step | Batch Size | Average | Lowest | Highest |
---|---|---|---|---|
Prefill | 1 | 279.56 tokens/secs | 216.22 tokens/secs | 311.22 tokens/secs |
2 | 535.36 tokens/secs | 431.27 tokens/secs | 622.92 tokens/secs | |
4 | 1072.61 tokens/secs | 803.66 tokens/secs | 1257.65 tokens/secs | |
8 | 1994.51 tokens/secs | 1591.15 tokens/secs | 2205.64 tokens/secs | |
16 | NaN tokens/secs | NaN tokens/secs | NaN tokens/secs | |
32 | ||||
Decode | 1 | 46.93 tokens/secs | 46.77 tokens/secs | 47.26 tokens/secs |
2 | 93.94 tokens/secs | 93.28 tokens/secs | 94.54 tokens/secs | |
4 | 187.90 tokens/secs | 186.33 tokens/secs | 190.61 tokens/secs | |
8 | 374.90 tokens/secs | 373.37 tokens/secs | 377.09 tokens/secs | |
16 | NaN tokens/secs | NaN tokens/secs | NaN tokens/secs | |
32 | ||||
w/ this pr | ||||
Step | Batch Size | Average | Lowest | Highest |
--------- | ------------ | --------------------- | --------------------- | --------------------- |
Prefill | 1 | 23.03 tokens/secs | 18.56 tokens/secs | 25.15 tokens/secs |
2 | 46.39 tokens/secs | 42.86 tokens/secs | 49.54 tokens/secs | |
4 | 92.00 tokens/secs | 67.12 tokens/secs | 98.78 tokens/secs | |
8 | 2003.16 tokens/secs | 1704.35 tokens/secs | 2235.91 tokens/secs | |
16 | NaN tokens/secs | NaN tokens/secs | NaN tokens/secs | |
32 | ||||
Decode | 1 | 50.72 tokens/secs | 47.04 tokens/secs | 51.00 tokens/secs |
2 | 101.47 tokens/secs | 98.24 tokens/secs | 102.12 tokens/secs | |
4 | 202.55 tokens/secs | 187.25 tokens/secs | 204.26 tokens/secs | |
8 | 375.23 tokens/secs | 373.78 tokens/secs | 377.80 tokens/secs | |
16 | NaN tokens/secs | NaN tokens/secs | NaN tokens/secs | |
32 |
We can see that the perf during prefill, especially on small bs case did drop significantly.
- Perf is significantly low on cases that bs<8, this is cause by a logic flaw of previous PR design, previous PR considered the
batch.batch_size
as the actual graph input size, which is not. In the laterrecombine
logit, the bs will be roundup to theBATCH_BUCKET_SIZE
, which is 8 by default. So if test bs < 8, the graphs are the same, there's no need to clear cache in this case. - The perf of prefill stage is slightly lower even bs >= 8. To understand this, we need to know the mechanism of this PR, i.e. remove previous hpu_graph when executing a new graph. If a graph is removed, it will be re-captured next time, which causes some host time. In this test case, the default benchmark input_len is 10 and output len is 8, it's a rather small case, hence the time fraction of graph capturing will increase, leads to a lower performance.
To solve problem 1, I'll change the cache clear condition using rounded batchsize like this.
To solve problem 2, I'll add a condition that we enable this feature only if user choose to use --limit_hpu_graphs
option, which does not cache prefill stage hpu_graph in the first place, as this feature will remove hpu_graph of prefill immediately in next decode stage.
So the final change will look like this, you can try it out before the fix PR is issued and merged. @mandy-li