guozhiyu opened this issue 2 years ago · 0 comments
I found the implementations of the T5 model from both t5x and Flaxformer don't use KV cache for cross-attention. Is it faster than using KV cache in cross-attention?