tensorops/TransformerX

[Enhancement] KV Caching for inference speed

soran-ghaderi opened this issue · 1 comments

Is your feature request related to a problem? Please describe.

caching the key and value matrices during the self-attention mechanism to reduce computational complexity and improve inference speed

Describe the solution you'd like

This caching mechanism reduces the need to recompute the full key and value matrices in every iteration of the decoding process, leading to faster inference.