question: how could I disable flash-attention for llama?
CoinCheung opened this issue · 4 comments
Hi,
I need to deploy my model on the old v100 gpus, and it seems that flash attention does not support v100 now, so I am thinking that maybe I can disable flash attention when I need to deploy with v100. How could I do this?
you can build your own image modify Dockerfile remove here
text-generation-inference/Dockerfile
Lines 90 to 98 in 95d3546
Hi @dongs0104 ,
Thanks for telling me this, but it seems that we cannot by-pass this flash attention if we want to deploy llama, I got an error like this:
also you can not use Sharded Model if you want sharded you can develop yourself.
check here https://github.com/OlivierDehaene/transformers/tree/text_generation_inference in gpt_neox model
Then I would rather wait for flash-attention to support my device. I am closing this. Thanks for help !!!