Release 4.4.0 and flash attention with python [WIP]
BBC-Esq opened this issue · 3 comments
It looks like Flash Attention was removed from the Python portion in release 4.4.0...I had a few questions:
-
Can you confirm that flash attention is still available in Release 4.3.1? No benchmarking was done on long-context QA with/without Flash Attention 2 that I'm aware of...only "relatively" short prompts/contexts. I'd like to be able to bench FA on longer contexts still to see if there's a meaningful benefit.
-
Is it possible to compile version 4.4.0 to include Flash Attention in a Wheel file even though a Wheel file won't be uploaded to pypi.org? If it's worth it, I'd like to use version 4.4.0's improvements AND flash attention if my benchmarking indicates it's advantageous to do so. I'm not very familiar with compiling in general so forgive the Q, but basically if I compile will it include the relevant "python" portions that you say are not omitted?
Thanks again for the great work.
-
I conducted some benchmarks with a long context (around 3000 tokens) and did not observe significant improvements. If you can do this benchmark on their side, I’d appreciate it. The release 4.3.1 supports always flash attention (there are some improvements in 4.4.0 release but not much, you can test with 4.3.1 is enough)
-
If we don't push the wheel file to pypi.org, we will need to establish a new release process similar to Flash Attention releases. For simplicity, more work is required on the Flash Attention feature, so we can reactivate it at a later stage.
Will bench in the near future when I have the time, hence WIP in the title, and let ya'll know if my results are different. I previously benched and noticed significant benefits when solely the beam_size
parameter was changed, but never got around to benching much longer contexts - e.g. 8k/16k - which is starting to become the norm (like 4k was to 2k, etc.)
UPDATE; Don't have time to bench but will try my best in the future. Closing for now.