intel/neural-speed

Loading checkpoint shards takes too long

irjawais opened this issue · 2 comments

When I load "meta-llama/Meta-Llama-3-8B-Instruct" model like this

from transformers import AutoTokenizer, TextStreamer from intel_extension_for_transformers.transformers import AutoModelForCausalLM model_name = "meta-llama/Meta-Llama-3-8B-Instruct" # Hugging Face model_id or local model tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) streamer = TextStreamer(tokenizer) model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)

it got hanged. Then only way is to restart instance to recover it.

Is there any issue in my spec?

my instance spec ubunu 32 GB RAM.

warnings.warn(
Loading checkpoint shards: 75%|█████████████████████████████████████████████████████████████████████████████████████████████████████████ | 3/4 [01:53<00:37, 37.72s/it]Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python3.10/dist-packages/intel_extension_for_transformers/transformers/modeling/modeling_auto.py", line 593, in from_pretrained
model.init( # pylint: disable=E1123
File "/usr/local/lib/python3.10/dist-packages/neural_speed/init.py", line 182, in init
assert os.path.exists(fp32_bin), "Fail to convert pytorch model"
AssertionError: Fail to convert pytorch model

@irjawais
Can you check the memory usage when converting the model? From your description, it seems that there may be insufficient memory.