[Question] How does lightllm implement nopad batching?
Opened this issue · 0 comments
Tomorrowdawn commented
Thanks for your great work! Here are my concerns:
Say we get a batch of inputs with lengths L1,L2,... How to simultaneously compute the attention scores of these inputs by 'nopad'? That sounds amazing but I failed to figure why when reading source code.
Additionally, in the decoding phase, how do you handle different kv length?(the code suggests kv cache is of a well-formed shape [B, num heads,...], which is confusing, because different prefixes result in different length of kv cache).
I want to implement batched speculative decoding and those details are important.
Thanks. Any detail, code or pseudo code are appreciated.