question: how could I disable flash-attention for llama?

Hi,

I need to deploy my model on the old v100 gpus, and it seems that flash attention does not support v100 now, so I am thinking that maybe I can disable flash attention when I need to deploy with v100. How could I do this?

you can build your own image modify Dockerfile remove here

text-generation-inference/Dockerfile

Lines 90 to 98 in 95d3546

    
           # Build Flash Attention CUDA kernels 
        
           FROM kernel-builder as flash-att-builder 
        
           WORKDIR /usr/src 
        
           COPY server/Makefile-flash-att Makefile 
        
           # Build specific version of flash attention 
        
           RUN make build-flash-attention

Hi @dongs0104 ,

Thanks for telling me this, but it seems that we cannot by-pass this flash attention if we want to deploy llama, I got an error like this:

also you can not use Sharded Model if you want sharded you can develop yourself.

check here https://github.com/OlivierDehaene/transformers/tree/text_generation_inference in gpt_neox model

Then I would rather wait for flash-attention to support my device. I am closing this. Thanks for help !!!

	# Build Flash Attention CUDA kernels
	FROM kernel-builder as flash-att-builder

	WORKDIR /usr/src

	COPY server/Makefile-flash-att Makefile

	# Build specific version of flash attention
	RUN make build-flash-attention