Support of loading of sharded models

Question

Support of loading of sharded models

Qubitium opened this issue 5 months ago · 2 comments

Does BITBLAS currently support loading of sharded models? For quants (4bit) of 70B+ models, hf has max 50GB upload limit for a single file so without sharding, it makes it harder to share quants of very large models. With llama 3.1 405B dropping in the new few hours, we are preparing to upload a bitblas compatible 4bit gptq quant but running into this sharding issue now. Thanks!

ModelCloud/GPTQModel#252

@LeiWang1999

Answer 1 · 2024-07-23T14:23:11.000Z

Thanks for reporting @Qubitium, let me take a look

Answer 2 · 2024-07-31T11:46:27.000Z

closed ref to ModelCloud/GPTQModel#252 (comment)