How to use the offloading in my MoE model?

Question

How to use the offloading in my MoE model?

Closed this issue 6 months ago · 4 comments

thx

Answer 1 · 2024-01-04T14:09:02.000Z

We fine-tuned Mixtral-8x7b with LoRA, and it performed well on Chinese chat. However even though we used int4 quantization it still requires 25GB of VRAM to run, we would like to know how to use offloading in our model? Thank you!

our model: https://github.com/WangRongsheng/Aurora

Answer 2 · 2024-01-06T11:40:57.000Z

Hey, @WangRongsheng!

The code in this repository currently only supports HQQ quantized models, thus using it for your model won't be as straightforward.

If you don't know where to start, please read our tech-report.

Answer 3 · 2024-01-07T17:37:22.000Z

Hello, @WangRongsheng!

As @dvmazur said, this repository currently supports only HQQ quantizaiton. But you can load our quantized model and add LoRA layers to it (I'm not sure whether peft.PeftModel will handle this correctly so it's possible that you will need to load them by hand). Note that LoRA weights should always reside on GPU because our offloading mechanism does not yet support custom weights offloading.