Why is the inference speed so slow?

Question

Opened this issue 7 months ago · 2 comments

It took 30s to generate ~100 tokens on A6000 GPU, which i found around 5x slower than LLAVA of same size and same quant. Why is it the case?

Answer 1 · 2024-03-14T05:54:29.000Z

I am trying to investigate it!

Is it sure that your model is in gpu vram?

Answer 2 · 2024-03-14T07:46:24.000Z

I think it may stem from flash attention

LLaVA official repository model and other huggingface models are normally applied with flash attention

However, I checked MoAI is not applied with it well.

Therefore, I will try to equip it!