ByungKwanLee/MoAI

Why is the inference speed so slow?

Opened this issue · 2 comments

It took 30s to generate ~100 tokens on A6000 GPU, which i found around 5x slower than LLAVA of same size and same quant. Why is it the case?

I am trying to investigate it!

Is it sure that your model is in gpu vram?

I think it may stem from flash attention

LLaVA official repository model and other huggingface models are normally applied with flash attention

However, I checked MoAI is not applied with it well.

Therefore, I will try to equip it!