KV cache in cross-attention of T5 model

Question

KV cache in cross-attention of T5 model

guozhiyu opened this issue 2 years ago · 0 comments

I found the implementations of the T5 model from both t5x and Flaxformer don't use KV cache for cross-attention. Is it faster than using KV cache in cross-attention?