Why is the inference speed so slow?
Opened this issue · 2 comments
khiemkhanh98 commented
It took 30s to generate ~100 tokens on A6000 GPU, which i found around 5x slower than LLAVA of same size and same quant. Why is it the case?
ByungKwanLee commented
I am trying to investigate it!
Is it sure that your model is in gpu vram?
ByungKwanLee commented
I think it may stem from flash attention
LLaVA official repository model and other huggingface models are normally applied with flash attention
However, I checked MoAI is not applied with it well.
Therefore, I will try to equip it!