lucidrains/x-transformers

Feature request: Multi-Head Latent Attention Support

nanowell opened this issue · 6 comments

x3

MLA, an attention mechanism equipped with low-rank key-value joint compression.
Empirically, MLA achieves superior performance compared with MHA, and meanwhile significantly reduces the KV cache during inference, thus boosting the inference efficiency.
image

image

image

Further reference for MLA architecture design can be found here:
https://arxiv.org/html/2405.04434v5

@nanowell yea, so i think this is just a way to improve inference but doesn't really add anything new