Token-wise the same generalization?

Question

Token-wise the same generalization?

Ageliss opened this issue 9 months ago · 2 comments

Is Medusa1 model generalize token-wise the same as the base model w.o. medusa head?

I found change medusa choices will change the output.

Answer 1 · 2024-05-21T09:24:48.000Z

We've figured out this problem by shrinking the medusa choices to only top-1 predictions, i.e., [(0), (0,0), (0,0,0), (0,0,0,0), (0,0,0,0,0)].

In such way, MHCA computation will get a bit-wise the same logits as the baseline wo medusa decoding.

Hope it helps for other people interested in bitwise the same decoding.

Answer 2 · 2024-05-21T09:25:23.000Z

We've figured out this problem by shrinking the medusa choices to only top-1 predictions, i.e., [(0), (0,0), (0,0,0), (0,0,0,0), (0,0,0,0,0)].

In such way, MHCA computation will get a bit-wise the same logits as the baseline wo medusa decoding.

Hope it helps for other people interested in bitwise the same decoding.