huggingface/text-generation-inference

question: how could I disable flash-attention for llama?

CoinCheung opened this issue · 4 comments

Hi,

I need to deploy my model on the old v100 gpus, and it seems that flash attention does not support v100 now, so I am thinking that maybe I can disable flash attention when I need to deploy with v100. How could I do this?

you can build your own image modify Dockerfile remove here

# Build Flash Attention CUDA kernels
FROM kernel-builder as flash-att-builder
WORKDIR /usr/src
COPY server/Makefile-flash-att Makefile
# Build specific version of flash attention
RUN make build-flash-attention

Hi @dongs0104 ,

Thanks for telling me this, but it seems that we cannot by-pass this flash attention if we want to deploy llama, I got an error like this:

图片

also you can not use Sharded Model if you want sharded you can develop yourself.

check here https://github.com/OlivierDehaene/transformers/tree/text_generation_inference in gpt_neox model

Then I would rather wait for flash-attention to support my device. I am closing this. Thanks for help !!!