FasterDecoding/Medusa

Token-wise the same generalization?

Ageliss opened this issue · 2 comments

Is Medusa1 model generalize token-wise the same as the base model w.o. medusa head?

I found change medusa choices will change the output.

We've figured out this problem by shrinking the medusa choices to only top-1 predictions, i.e., [(0), (0,0), (0,0,0), (0,0,0,0), (0,0,0,0,0)].

In such way, MHCA computation will get a bit-wise the same logits as the baseline wo medusa decoding.

Hope it helps for other people interested in bitwise the same decoding.

We've figured out this problem by shrinking the medusa choices to only top-1 predictions, i.e., [(0), (0,0), (0,0,0), (0,0,0,0), (0,0,0,0,0)].

In such way, MHCA computation will get a bit-wise the same logits as the baseline wo medusa decoding.

Hope it helps for other people interested in bitwise the same decoding.