intel/intel-extension-for-transformers

Fails to load saved model : Trying to set a tensor of shape torch.Size([1376, 4096]) in "qweight" (which has shape torch.Size([4096, 1376])), this look incorrect.

kranipa opened this issue · 8 comments

Loading saved model runs into following error
It also takes a very long time to run and save quantized models.

2024-03-21 08:48:58 [INFO] loading weights file models/4_bit_llama2-rtn/model.safetensors
2024-03-21 08:48:58 [ERROR] Trying to set a tensor of shape torch.Size([1376, 4096]) in "qweight" (which has shape torch.Size([4096, 1376])), this look incorrect.
2024-03-21 08:48:58 [ERROR] Saved low bit model loading failed, please check your model.

Tried following example.

import torch
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, RtnConfig, GPTQConfig, AwqConfig

model_path = "meta-llama/Llama-2-7b-chat-hf" # your_pytorch_model_path_or_HF_model_name
saved_dir = "models/4_bit_llama2-rtn" # your_saved_model_dir
#model_path  = "Intel/neural-chat-7b-v3-3" 
#saved_dir = "models/4_bit_neural_chat_7b-v3-3-rtn"
# quant
woq_config = RtnConfig(bits=4, compute_dtype="int8", scale_dtype='fp32', group_size=32)
model = AutoModelForCausalLM.from_pretrained(model_path, 
                                            device_map='cpu',
                                            torch_dtype=torch.float16,
                                            quantization_config=woq_config, 
                                            trust_remote_code=True,
                                            use_neural_speed=False)
# save quant model
model.save_pretrained(saved_dir)
load quant model
loaded_model = AutoModelForCausalLM.from_pretrained(saved_dir,trust_remote_code = True)
intel-extension-for-transformers ==1.4rc2.dev8+g494a5712fa2
neural-compressor==2.4.1
neural-speed==0.4.dev21+g0ec1a6e

model = AutoModelForCausalLM.from_pretrained(model_path, device_map='cpu', torch_dtype=torch.float16, quantization_config=woq_config, trust_remote_code=True, _use_neural_speed=False_)
Do you want to use neural_speed? If yes, try to use neural speed = True.

Thank you for the response.

using use_neural_speed=True save function doesnt work.

I get following error

AttributeError: 'Model' object has no attribute 'save_pretrained'

can you share an example how to save quantized model ( Model object.) with neural_speed

It looks like load/save mismatch, can you try to use latest commit instead of g494a5712fa2 and set use_neural_speed=False?

Hi, Thank you. Saving works, however loading the saved model leads to following error


    raise ValueError(
ValueError: Unknown quantization type, got rtn - supported types are: ['awq', 'bitsandbytes_4bit', 'bitsandbytes_8bit', 'gptq', 'aqlm']

following is the code snippet

import torch
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, RtnConfig, GPTQConfig, AwqConfig


model_path = "meta-llama/Llama-2-7b-chat-hf" # your_pytorch_model_path_or_HF_model_name
saved_dir = "models/4_bit_llama2-rtn" # your_saved_model_dir
#model_path  = "Intel/neural-chat-7b-v3-3" 
#saved_dir = "models/4_bit_neural_chat_7b-v3-3-rtn"
# quant
woq_config = RtnConfig(bits=4)
model = AutoModelForCausalLM.from_pretrained(model_path, 
                                            device_map='cpu',
                                            #torch_dtype=torch.float16,
                                            quantization_config=woq_config, 
                                            trust_remote_code=True,
                                            use_neural_speed=False)
# save quant model
model.save_pretrained(saved_dir)
#load quant model
loaded_model = AutoModelForCausalLM.from_pretrained(saved_dir,trust_remote_code = True)

@kranipa , This issue is caused by mismatch the version of ITREX and neural-compressor. You can use neural-compressor version 2.5.1 and try it again. ITREX 1.4 is released now, Please try it. thanks very much.

okay , thank you.

@kranipa Did you get it to run? I'm having the same problem.

@PhzCode , could you post your code and let me try to reproduce it. thanks very much.