How to use flash attention for inference speed boosting in bert like models

Question

pradeepdev-1995 opened this issue 6 months ago · 1 comments

is flash attention support bert like models(bert,distilbert,roberta...etc) for reducing inference time latency?

If it supports, how to use flash attention for inference speed boosting in bert like models?
share the sample code

Answer 1 · 2024-04-18T09:55:49.000Z

Please search for "bert" in the repo. If you don't put in effort I don't think you'll get help.