Why fused attention is only applicable on Ampere GPUs?
rayleizhu opened this issue · 1 comments
rayleizhu commented
Hi, I'm writing my operator using fused attention as a template. However, I found that fused attention requires an Ampere arch:
I do not understand this.
- Does it mean this template uses some arch-specific operators?
- To use it on Volta GPU, how should I modify it?
Besides, it seems that only head_dim=64 is supported, right? How can I fix it for the head_dim=32 case?