huggingface/tgi-gaudi

https://github.com/huggingface/tgi-gaudi/pull/176 causes performance regression for benchmark

mandy-li opened this issue · 3 comments

System Info

build the docker with latest in habana-main branch, inside docker, run:

text-generation-benchmark -t mistralai/Mixtral-8x7B-Instruct-v0.1

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Run the benchmark for Mixtral 8x7B on 2x HPUs.

with this PR:

Step Batch Size Average Lowest Highest
Prefill 1 3.75 tokens/secs 3.23 tokens/secs 3.87 tokens/secs
2 7.73 tokens/secs 7.31 tokens/secs 8.05 tokens/secs
4 15.85 tokens/secs 15.64 tokens/secs 16.06 tokens/secs
8 287.04 tokens/secs 281.86 tokens/secs 293.99 tokens/secs
16 510.34 tokens/secs 481.08 tokens/secs 547.25 tokens/secs
32 880.87 tokens/secs 857.41 tokens/secs 903.95 tokens/secs
Decode 1 16.76 tokens/secs 16.64 tokens/secs 16.84 tokens/secs
2 33.53 tokens/secs 32.69 tokens/secs 34.22 tokens/secs
4 68.43 tokens/secs 68.16 tokens/secs 68.84 tokens/secs
8 225.42 tokens/secs 224.82 tokens/secs 225.82 tokens/secs
16 404.56 tokens/secs 401.84 tokens/secs 407.02 tokens/secs
32 653.55 tokens/secs 651.50 tokens/secs 655.44 tokens/secs

Without this PR:

Step Batch Size Average Lowest Highest
Prefill 1 38.37 tokens/secs 38.29 tokens/secs 38.57 tokens/secs
2 77.02 tokens/secs 76.86 tokens/secs 77.70 tokens/secs
4 154.88 tokens/secs 154.75 tokens/secs 155.11 tokens/secs
8 301.49 tokens/secs 250.01 tokens/secs 316.98 tokens/secs
16 592.47 tokens/secs 544.04 tokens/secs 607.58 tokens/secs
32 970.91 tokens/secs 740.86 tokens/secs 1027.14 tokens/secs
Decode 1 28.60 tokens/secs 28.59 tokens/secs 28.61 tokens/secs
2 57.17 tokens/secs 57.13 tokens/secs 57.19 tokens/secs
4 114.30 tokens/secs 114.26 tokens/secs 114.33 tokens/secs
8 224.53 tokens/secs 223.37 tokens/secs 228.83 tokens/secs
16 399.37 tokens/secs 398.59 tokens/secs 402.06 tokens/secs
32 648.88 tokens/secs 645.33 tokens/secs 665.85 tokens/secs

Expected behavior

no regression

I'm checking it

I've tried to reproduce it on the llama-7b, here are results:
w/o this PR

Step Batch Size Average Lowest Highest
Prefill 1 279.56 tokens/secs 216.22 tokens/secs 311.22 tokens/secs
2 535.36 tokens/secs 431.27 tokens/secs 622.92 tokens/secs
4 1072.61 tokens/secs 803.66 tokens/secs 1257.65 tokens/secs
8 1994.51 tokens/secs 1591.15 tokens/secs 2205.64 tokens/secs
16 NaN tokens/secs NaN tokens/secs NaN tokens/secs
32
Decode 1 46.93 tokens/secs 46.77 tokens/secs 47.26 tokens/secs
2 93.94 tokens/secs 93.28 tokens/secs 94.54 tokens/secs
4 187.90 tokens/secs 186.33 tokens/secs 190.61 tokens/secs
8 374.90 tokens/secs 373.37 tokens/secs 377.09 tokens/secs
16 NaN tokens/secs NaN tokens/secs NaN tokens/secs
32
w/ this pr
Step Batch Size Average Lowest Highest
--------- ------------ --------------------- --------------------- ---------------------
Prefill 1 23.03 tokens/secs 18.56 tokens/secs 25.15 tokens/secs
2 46.39 tokens/secs 42.86 tokens/secs 49.54 tokens/secs
4 92.00 tokens/secs 67.12 tokens/secs 98.78 tokens/secs
8 2003.16 tokens/secs 1704.35 tokens/secs 2235.91 tokens/secs
16 NaN tokens/secs NaN tokens/secs NaN tokens/secs
32
Decode 1 50.72 tokens/secs 47.04 tokens/secs 51.00 tokens/secs
2 101.47 tokens/secs 98.24 tokens/secs 102.12 tokens/secs
4 202.55 tokens/secs 187.25 tokens/secs 204.26 tokens/secs
8 375.23 tokens/secs 373.78 tokens/secs 377.80 tokens/secs
16 NaN tokens/secs NaN tokens/secs NaN tokens/secs
32

We can see that the perf during prefill, especially on small bs case did drop significantly.

  1. Perf is significantly low on cases that bs<8, this is cause by a logic flaw of previous PR design, previous PR considered the batch.batch_size as the actual graph input size, which is not. In the later recombine logit, the bs will be roundup to the BATCH_BUCKET_SIZE, which is 8 by default. So if test bs < 8, the graphs are the same, there's no need to clear cache in this case.
  2. The perf of prefill stage is slightly lower even bs >= 8. To understand this, we need to know the mechanism of this PR, i.e. remove previous hpu_graph when executing a new graph. If a graph is removed, it will be re-captured next time, which causes some host time. In this test case, the default benchmark input_len is 10 and output len is 8, it's a rather small case, hence the time fraction of graph capturing will increase, leads to a lower performance.

To solve problem 1, I'll change the cache clear condition using rounded batchsize like this.
image
To solve problem 2, I'll add a condition that we enable this feature only if user choose to use --limit_hpu_graphs option, which does not cache prefill stage hpu_graph in the first place, as this feature will remove hpu_graph of prefill immediately in next decode stage.
So the final change will look like this, you can try it out before the fix PR is issued and merged. @mandy-li
image

verified that PR fixes the issue.