microsoft/BitBLAS

Support of loading of sharded models

Qubitium opened this issue · 2 comments

Does BITBLAS currently support loading of sharded models? For quants (4bit) of 70B+ models, hf has max 50GB upload limit for a single file so without sharding, it makes it harder to share quants of very large models. With llama 3.1 405B dropping in the new few hours, we are preparing to upload a bitblas compatible 4bit gptq quant but running into this sharding issue now. Thanks!

ModelCloud/GPTQModel#252

@LeiWang1999

Thanks for reporting @Qubitium, let me take a look