vikhyat/mixtral-inference

GPUs

fakerybakery opened this issue · 1 comments

Hi,
Great repo! You mentioned you need quite a few A100s. If this model is ~50B parameters and ppl can run Llama 2 70B on 1xA100, why does this take so much compute?
Thank you!

I've never tried Llama 70B, but this is running in fp16 without any quantization. That might be part of it?