google-research/t5x

KV cache in cross-attention of T5 model

guozhiyu opened this issue · 0 comments

I found the implementations of the T5 model from both t5x and Flaxformer don't use KV cache for cross-attention. Is it faster than using KV cache in cross-attention?