triton-lang/triton

Why fused attention is only applicable on Ampere GPUs?

rayleizhu opened this issue · 1 comments

Hi, I'm writing my operator using fused attention as a template. However, I found that fused attention requires an Ampere arch:

https://github.com/openai/triton/blob/d376020f90002757eea3ea9475d4f7cfc2ec5ead/python/triton/ops/flash_attention.py#L200

I do not understand this.

  • Does it mean this template uses some arch-specific operators?
  • To use it on Volta GPU, how should I modify it?

Besides, it seems that only head_dim=64 is supported, right? How can I fix it for the head_dim=32 case?

https://github.com/openai/triton/blob/d376020f90002757eea3ea9475d4f7cfc2ec5ead/python/triton/ops/flash_attention.py#L207

there is some more information in #616