Why fused attention is only applicable on Ampere GPUs?

Question

rayleizhu opened this issue 2 years ago · 1 comments

Hi, I'm writing my operator using fused attention as a template. However, I found that fused attention requires an Ampere arch:

I do not understand this.

Besides, it seems that only head_dim=64 is supported, right? How can I fix it for the head_dim=32 case?

Answer 1 · 2023-03-06T01:37:55.000Z

there is some more information in #616