dvmazur/mixtral-offloading

How to use the offloading in my MoE model?

Closed this issue · 4 comments

How to use the offloading in my MoE model?

We fine-tuned Mixtral-8x7b with LoRA, and it performed well on Chinese chat. However even though we used int4 quantization it still requires 25GB of VRAM to run, we would like to know how to use offloading in our model? Thank you!

Hey, @WangRongsheng!

The code in this repository currently only supports HQQ quantized models, thus using it for your model won't be as straightforward.

If you don't know where to start, please read our tech-report.

Hello, @WangRongsheng!

As @dvmazur said, this repository currently supports only HQQ quantizaiton. But you can load our quantized model and add LoRA layers to it (I'm not sure whether peft.PeftModel will handle this correctly so it's possible that you will need to load them by hand). Note that LoRA weights should always reside on GPU because our offloading mechanism does not yet support custom weights offloading.