Feature request: Multi-Head Latent Attention Support

Question

Feature request: Multi-Head Latent Attention Support

nanowell opened this issue 7 months ago · 6 comments

MLA, an attention mechanism equipped with low-rank key-value joint compression.
Empirically, MLA achieves superior performance compared with MHA, and meanwhile significantly reduces the KV cache during inference, thus boosting the inference efficiency.

Further reference for MLA architecture design can be found here:
https://arxiv.org/html/2405.04434v5

Answer 1 · 2024-07-26T14:08:28.000Z

@nanowell yea, so i think this is just a way to improve inference but doesn't really add anything new