How to benchmark for speedup and acceptance rate?
singularity-s0 opened this issue · 7 comments
Sorry for asking a possibly obvious question but it would be better if the documentation makes this clear.
+1 How to benchmark the speed up? I ran the example codes and didn't see obvious acceleration. How to reproduce 4.04x accelerate of Llama2-7b on A100?
+1 How to benchmark the speed up? I ran the example codes and didn't see obvious acceleration. How to reproduce 4.04x accelerate of Llama2-7b on A100?
To run Sequoia:
CUDA_VISIBLE_DEVICES=0 python testbed_greedy.py --model JackFram/llama-68m --target meta-llama/Llama-2-7b-hf --T 0.6 --P 1.0 --start 0 --end 200 --M 384 --growmap ../A100_growmaps/68m_7b/growmaps/A100-C4-68m-7b-greedy.pt --Mode greedy --dataset c4
To run baseline:
CUDA_VISIBLE_DEVICES=0 python testbed_greedy.py --model JackFram/llama-68m --target meta-llama/Llama-2-7b-hf --T 0.6 --P 1.0 --start 0 --end 200 --M 384 --growmap ../A100_growmaps/68m_7b/growmaps/A100-C4-68m-7b-greedy.pt --Mode baseline --dataset c4
As the framework is written in Huggingface, the baseline should be around 23ms ~ 25ms per token, Sequoia should be 6ms ~ 7ms per token.
Thanks for the response. How about acceptance rate? What does decoding step
and large model step
mean in the output?
decoding step means how many tokens are generated. large model step means how many times large model do verification. decoding step / large model step reflects how many tokens are correctly predicted with Sequoia's tree.
acceptance rate needs to be independently measured with
python test_accept.py --model JackFram/llama-68m --target meta-llama/Llama-2-7b-hf
--T 0.6 --P 1.0 --start 0 --end 200 --M 288 --W 32
--ALG stochastic --dataset cnn \
Thank you. This answers all my questions.
After testing both baseline
and greedy
on C4 dataset on A100, I get the following result:
Baseline: total time :110.10318s, latency :0.02298s, decoding step: 4791
Greedy: total time :144.56247s, latency :0.00813s, decoding step: 17778, large model step: 4605, 3.8605863192182412
It seems that more tokens are being generated in greedy mode than in baseline mode. Although the generation latency is the same as expected, I wonder if it is rather unfair to compare latency when generating different tokens. Would it be better if we set a fixed sequence length and compare generation time instead?
decoding step / large model step reflects how many tokens are correctly predicted with Sequoia's tree.
Just to make sure I understand this correctly, if all drafts are wrong, then decoding step / large model step = 1
. And if decoding step / large model step = 2
, it means that on average, the drafting model gets 1 token correct per draft. Is this right?
Your understanding is correct. We only allow baseline to generate 32 tokens is because in some experiments, such as Vicuna33B, running baseline can cost a lot of time.
You can change this manually if you want. What you need to modify is inner_decoding_step < 32
in testbed.py.
Also, we plan to update the code in the following weeks. We will solve the problem.