Issues
- 1
expanded_attn_mask = causal_4d_mask.masked_fill(expanded_attn_mask.bool(), torch.finfo(dtype).min) RuntimeError: The size of tensor a (3509) must match the size of tensor b (7017) at non-singleton dimension 3
#13 opened by seeyourcell - 0
Can't not run longbench!
#12 opened by HarryWu99 - 0
why only decode do compress?
#11 opened by CSEEduanyu - 1
Only kv is compressed. Is the size of Q and K inconsistent when attention is calculated?
#10 opened by CSEEduanyu - 1
It seems that snapkv need to be able to do "prefill" at least once before the prompt can be compressed.
#9 opened by 66RING - 8
Questions on paper and code [prompting for mistral, positional index, minor errors & questions in paper]
#1 opened by MarsJacobs - 1
Grouped query attention implementation
#4 opened by guozhiyu - 1
maybe a bug in `update_kv` function
#3 opened by HarryWu99 - 1