Compatibility with KV-Cache

Question

Compatibility with KV-Cache

LukeForeverYoung opened this issue 4 months ago · 1 comments

It appears that when use_cache=True is set, the visual tokens are pruned and remain unchanged during each step of inference. Conversely, when use_cache is set to False, the visual tokens undergo dynamic pruning.

I'm uncertain whether enabling the KV cache might compromise the performance of the language model. Do you have any insights on this matter?

Answer 1 · 2024-04-19T09:43:25.000Z

Hi, I think it might depend on particular downstream tasks. In my opinion, current kv-cache fastv implementation would lead to different performance than the non-kvcache version. But I haven’t tested it officially. According to the lmms-eval results (kv-cache enabled), many tasks performance still keep up with full image token version. I think different tasks may have different sensitivity to current fastv implementation with kv-cache.

In other point, we could make fastv with kvcache behave the same as non-kvcache version by dynamically pruning the kv cache for image tokens but we need to store all image kv cache first. It might increase the first forward’s latency. The trade-off might not be optimal for lmms with small amount of visual tokens. I think it is very promising for future/big lmms with much more image/video tokens.