Dao-AILab/flash-attention

why flash can't accelerate on A40 machine?

zhangxihou opened this issue · 1 comments

Hi i tested flash-attn operation on an A40 machine and it showed no improvement on trainning speed , Moreover , I printed out the time cost of self-attention calculation part between two models . One model used normal attention , the other used flash-attn. Aart from that ,these two models are consistent. So can flash-atten only accelerate trainning speed on A100, but A40 can't? by the way,flash-atten did reduce the cuda memory usage!
paste the env and other configurations:
work: speech recognition
env: cuda12.1 torch2.1.2 flash-atten 2.5.2
models: 11*conformers
operation used: flash_attn_qkvpacked_func

Please benchmark just the attention operation