certik/fastGPT

Implement kv cache

certik opened this issue · 0 comments

Here is some information what kv cache is: https://kipp.ly/blog/transformer-inference-arithmetic/#kv-cache

Roughly speaking, when new tokens are added at the end of the input and new token is generated, a lot of the computation could be reused from the previous iteration. We need to cache the results and reused them.

Here is a reference implementation in picoGPT: jaymody/picoGPT#7 (and the accompanying blog post https://immortal3.github.io/becoming-the-unbeatable/posts/gpt-kvcache/) that should be straightforward to adapt.