https://github.com/huggingface/tgi-gaudi/pull/176 causes performance regression for benchmark

Question

https://github.com/huggingface/tgi-gaudi/pull/176 causes performance regression for benchmark

mandy-li opened this issue 6 months ago · 3 comments

mandy-li commented 6 months ago

System Info

build the docker with latest in habana-main branch, inside docker, run:

text-generation-benchmark -t mistralai/Mixtral-8x7B-Instruct-v0.1

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Run the benchmark for Mixtral 8x7B on 2x HPUs.

with this PR:

Step	Batch Size	Average	Lowest	Highest
Prefill	1	3.75 tokens/secs	3.23 tokens/secs	3.87 tokens/secs
	2	7.73 tokens/secs	7.31 tokens/secs	8.05 tokens/secs
	4	15.85 tokens/secs	15.64 tokens/secs	16.06 tokens/secs
	8	287.04 tokens/secs	281.86 tokens/secs	293.99 tokens/secs
	16	510.34 tokens/secs	481.08 tokens/secs	547.25 tokens/secs
	32	880.87 tokens/secs	857.41 tokens/secs	903.95 tokens/secs
Decode	1	16.76 tokens/secs	16.64 tokens/secs	16.84 tokens/secs
	2	33.53 tokens/secs	32.69 tokens/secs	34.22 tokens/secs
	4	68.43 tokens/secs	68.16 tokens/secs	68.84 tokens/secs
	8	225.42 tokens/secs	224.82 tokens/secs	225.82 tokens/secs
	16	404.56 tokens/secs	401.84 tokens/secs	407.02 tokens/secs
	32	653.55 tokens/secs	651.50 tokens/secs	655.44 tokens/secs

Without this PR:

Step	Batch Size	Average	Lowest	Highest
Prefill	1	38.37 tokens/secs	38.29 tokens/secs	38.57 tokens/secs
	2	77.02 tokens/secs	76.86 tokens/secs	77.70 tokens/secs
	4	154.88 tokens/secs	154.75 tokens/secs	155.11 tokens/secs
	8	301.49 tokens/secs	250.01 tokens/secs	316.98 tokens/secs
	16	592.47 tokens/secs	544.04 tokens/secs	607.58 tokens/secs
	32	970.91 tokens/secs	740.86 tokens/secs	1027.14 tokens/secs
Decode	1	28.60 tokens/secs	28.59 tokens/secs	28.61 tokens/secs
	2	57.17 tokens/secs	57.13 tokens/secs	57.19 tokens/secs
	4	114.30 tokens/secs	114.26 tokens/secs	114.33 tokens/secs
	8	224.53 tokens/secs	223.37 tokens/secs	228.83 tokens/secs
	16	399.37 tokens/secs	398.59 tokens/secs	402.06 tokens/secs
	32	648.88 tokens/secs	645.33 tokens/secs	665.85 tokens/secs

Expected behavior

no regression

Answer 1 · 2024-07-15T02:16:39.000Z

I'm checking it

Answer 2 · 2024-07-15T07:27:10.000Z

I've tried to reproduce it on the llama-7b, here are results:
w/o this PR

Step	Batch Size	Average	Lowest	Highest
Prefill	1	279.56 tokens/secs	216.22 tokens/secs	311.22 tokens/secs
	2	535.36 tokens/secs	431.27 tokens/secs	622.92 tokens/secs
	4	1072.61 tokens/secs	803.66 tokens/secs	1257.65 tokens/secs
	8	1994.51 tokens/secs	1591.15 tokens/secs	2205.64 tokens/secs
	16	NaN tokens/secs	NaN tokens/secs	NaN tokens/secs
	32
Decode	1	46.93 tokens/secs	46.77 tokens/secs	47.26 tokens/secs
	2	93.94 tokens/secs	93.28 tokens/secs	94.54 tokens/secs
	4	187.90 tokens/secs	186.33 tokens/secs	190.61 tokens/secs
	8	374.90 tokens/secs	373.37 tokens/secs	377.09 tokens/secs
	16	NaN tokens/secs	NaN tokens/secs	NaN tokens/secs
	32
w/ this pr
Step	Batch Size	Average	Lowest	Highest
---------	------------	---------------------	---------------------	---------------------
Prefill	1	23.03 tokens/secs	18.56 tokens/secs	25.15 tokens/secs
	2	46.39 tokens/secs	42.86 tokens/secs	49.54 tokens/secs
	4	92.00 tokens/secs	67.12 tokens/secs	98.78 tokens/secs
	8	2003.16 tokens/secs	1704.35 tokens/secs	2235.91 tokens/secs
	16	NaN tokens/secs	NaN tokens/secs	NaN tokens/secs
	32
Decode	1	50.72 tokens/secs	47.04 tokens/secs	51.00 tokens/secs
	2	101.47 tokens/secs	98.24 tokens/secs	102.12 tokens/secs
	4	202.55 tokens/secs	187.25 tokens/secs	204.26 tokens/secs
	8	375.23 tokens/secs	373.78 tokens/secs	377.80 tokens/secs
	16	NaN tokens/secs	NaN tokens/secs	NaN tokens/secs
	32

We can see that the perf during prefill, especially on small bs case did drop significantly.

Perf is significantly low on cases that bs<8, this is cause by a logic flaw of previous PR design, previous PR considered the batch.batch_size as the actual graph input size, which is not. In the later recombine logit, the bs will be roundup to the BATCH_BUCKET_SIZE, which is 8 by default. So if test bs < 8, the graphs are the same, there's no need to clear cache in this case.
The perf of prefill stage is slightly lower even bs >= 8. To understand this, we need to know the mechanism of this PR, i.e. remove previous hpu_graph when executing a new graph. If a graph is removed, it will be re-captured next time, which causes some host time. In this test case, the default benchmark input_len is 10 and output len is 8, it's a rather small case, hence the time fraction of graph capturing will increase, leads to a lower performance.

To solve problem 1, I'll change the cache clear condition using rounded batchsize like this.

To solve problem 2, I'll add a condition that we enable this feature only if user choose to use --limit_hpu_graphs option, which does not cache prefill stage hpu_graph in the first place, as this feature will remove hpu_graph of prefill immediately in next decode stage.
So the final change will look like this, you can try it out before the fix PR is issued and merged. @mandy-li

Answer 3 · 2024-07-15T19:14:45.000Z

verified that PR fixes the issue.