Feature request: Multi-Head Latent Attention Support
nanowell opened this issue · 6 comments
nanowell commented
MLA, an attention mechanism equipped with low-rank key-value joint compression.
Empirically, MLA achieves superior performance compared with MHA, and meanwhile significantly reduces the KV cache during inference, thus boosting the inference efficiency.
Further reference for MLA architecture design can be found here:
https://arxiv.org/html/2405.04434v5
lucidrains commented
@nanowell yea, so i think this is just a way to improve inference but doesn't really add anything new