Cannot finish FP4 quantization: `RuntimeError: Qbits: only support Integer WOQ in PACKQ`
PhzCode opened this issue · 6 comments
Hey! I'm trying to quantize llama3-8b to FP4 by running the code in https://github.com/intel/intel-extension-for-transformers/blob/main/examples/huggingface/pytorch/text-generation/quantization/run_generation_cpu_woq.py
Here's my run command:
python run_generation_cpu_woq.py --model llama_3_8b --woq --woq_algo GPTQ --bits 4 --weight_dtype fp4_e2m1 --scale_dtype fp32 --compute_dtype fp32 --benchmark --accuracy --load_in_4bit
In run_generation_cpu_woq.py, the input parameter weight_dtype
of FP4 has only fp4_e2m1_bnb
and fp4_e2m1
. However, /intel_extension_for_transformers/transformers/llm/quantization/utils.py:269 checks whether it is fp4
. The branch finally reaches line 281, enters the INT type logic, and reports the error RuntimeError: Qbits: only support Integer WOQ in PACKQ.
How do I run the program to do the FP4 quantization?
Related software packages and versions are as follows:
intel-extension-for-pytorch 2.3.0
intel-extension-for-transformers 1.4.1
neural_compressor 2.6.dev30+g7c0b700
neural-speed 1.0
Here's the error:
2024-05-28 22:27:33 [INF0] Save deploy yaml to /home/nc-workspace/2024-05-28 14-12-59/deploy.yaml
/root/miniconda3/envs/python310/Lib/python3.10/site-packages/bitsandbytes/cextension py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-
bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support.
/root/miniconda3/envs/python310/Lib/python3.10/site-packages/bitsandbytes/Libbitsandbytes cpu.so: undefined symbol: cadam32bit grad fp32
Traceback (most recent call last):
File "/home/psi/run generation cpu woq.py", line 299, in <module>
user model = AutoModelForCausalLM.from pretrained(
File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/transformers/modeling/modeling_auto.py", line 672, in from_pretrained
model = convert to quantized model(
File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel extension for transformers/transformers/llm/quantization/utils.py", line 563, in convert to quantized mo
del
q_model = replace linear(inc model, None, None, config, device=device)
File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel extension for transformers/transformers/llm/quantization/utils.py", line 119, in replace linear
model, is replaced = replace linear(
File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/transformers/llm/quantization/utils.py", line 316, in _replace_linear
Fito t9-SPASR2enaPAS/thon810/1ib/pythan3.10/sie packages/intol_extensien_for_transformers/transformers/LlM/quentization/utils.py", Line 316, in _replace_Linear产
, is replaced = replace linear(
FiTe "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/transformers/llm/quantization/utils.py", line 316, in _replace_linear
_, is_replaced = _replace_linear(
[Previous line repeated 1 more time]
File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel extension for transformers/transformers/llm/quantization/utils.py", line 289, in replace linear
model. modules[name].set weights bias(
File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/transformers/llm/quantization/nn/modules.py", line 226, in set_weights bias
packw = qbits.repack quantized weight(
RuntimeError: Qbits: only support Integer WOQ in PACKQ
Exception raised from bestla_packq at /tmp/pip-install-ue0q49gf/intel-extension-for-transformers _la94d9ed4b1947cca3046b78bd887212/intel_extension_for_transformers/qbits/dispatcher
/src/bestla packq_impl.cpp:185 (most recent call first):
frame #O: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7faa71244167 in /root/miniconda3/envs/python310/Lib/python3.10/site-packages/torch/lib/libc10.so
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7faa711f432e in /root/miniconda3/envs/python310/Lib/python3.10/site-packages/t
orch/lib/libc10.so)
frame #2: woq::bestla packq(woq::repack quantized weight param*, woq::repack quantized weight ctx*, woq::WOQ TASK) + 0x22d (0x7fa9e875097d in /root/miniconda3/envs/python310/Lib/p
ython3.10/site-packages/intel extension for transformers/qbits py.cpython-310-x86 64-Linux-gnu.so)
frame #3: <unknown function> + Oxae20c (Ox7fa9e871b20c in /root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/qbits_py.cpython-310-x86_64
-inux-gnu.so)
frame #4: <unknown function> + Oxaef34 (Ox7fa9e871bf34 in /root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/qbits_py.cpython-310-x86_64
-linux-gnu.so)
frame #5: <unknown function> + Oxa7822 (Ox7fa9e8714822 in /root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/qbits_py.cpython-310-x86_64
-linux-gnu.so)
frame #6: python() [0x4fd907]
<omitting python frames>
frame #9: python() [0x5095ce]
frame #25: python() [0x5095ce]
frame #27: python() [0x5951c2]
In the /intel_extension_for_transformers/transformers/llm/quantization/utils.py
file, I added the content of fp4_e2m1_bnb
, fp4_e2m1
judgment to the place where FP4 needs to be judged. Finally, the quantization is successful and the model is output. However, an error occurs during inference.
Output as follow:
2024-05-30 23:36:32 [INFO] Quantization done
2024-05-30 23:36:41 [INFO] GPTQ quantizing done.
2024-05-30 23:36:53 [INFO] |******Mixed Precision Statistics******|
2024-05-30 23:36:53 [INFO] +---------+-------+-----------+--------+
2024-05-30 23:36:53 [INFO] | Op Type | Total | A32W4G128 | FP32 |
2024-05-30 23:36:53 [INFO] +---------+-------+-----------+--------+
2024-05-30 23:36:53 [INFO] | Linear | 225 | 224 | 1 |
2024-05-30 23:36:53 [INFO] +---------+-------+-----------+--------+
2024-05-30 23:36:53 [INFO] Pass quantize model elapsed time: 28941231.53 ms
2024-05-30 23:36:53 [INFO] Save tuning history to /home/nc_workspace/2024-05-30_15-33-54/./history.snapshot.
2024-05-30 23:36:53 [INFO] [Strategy] Found the model meets accuracy requirements, ending the tuning process.
2024-05-30 23:36:53 [INFO] Specified timeout or max trials is reached! Found a quantized model which meet accuracy goal. Exit.
2024-05-30 23:36:53 [INFO] Save deploy yaml to /home/nc_workspace/2024-05-30_15-33-54/deploy.yaml
/root/miniconda3/envs/python310/lib/python3.10/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
2024-05-30 23:37:06 [INFO] WeightOnlyQuant done.
2024-05-30 23:47:33 [INFO] Configuration saved in ./saved_results/quantize_config.json
2024-05-30 23:47:33 [INFO] quantization_config: {'bits': 4, 'compute_dtype': 'fp32', 'damp_percent': 0.01, 'desc_act': False, 'group_size': 128, 'llm_int8_skip_modules': [], 'quant_method': 'gptq', 'scale_dtype': 'fp32', 'sym': True, 'weight_dtype': 'fp4_e2m1'}
2024-05-30 23:47:33 [INFO] loading weights file ./saved_results/model.safetensors.index.json
/root/miniconda3/envs/python310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
Loading model from: ./saved_results
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]
2024-05-30 23:47:33 [ERROR] Trying to set a tensor of shape torch.Size([3584, 4096]) in "qweight" (which has shape torch.Size([1792, 4096])), this look incorrect.
2024-05-30 23:47:33 [ERROR] Saved low bit model loading failed, please check your model.
I suspect the quantization is also wrong. Quantize the llama3-8b model as well, using SQ for int8 quantization (using the run_generation_sq.py file) and GPTQ for FP4 quantization (using the run_generation_cpu_woq.py file). The following output shows that the compression ratio of FP4 is lower than that of INT8.
Output as follow:
saved_results
directory is FP4
(python310) [root@localhost]# ll llama_3_8b
total 15693104
-rw-r--r--. 1 root root 654 May 7 15:11 config.json
-rw-r--r--. 1 root root 126 May 7 15:11 generation_config.json
-rw-r--r--. 1 root root 4976698672 May 7 15:32 model-00001-of-00004.safetensors
-rw-r--r--. 1 root root 4999802720 May 7 15:32 model-00002-of-00004.safetensors
-rw-r--r--. 1 root root 4915916176 May 7 15:32 model-00003-of-00004.safetensors
-rw-r--r--. 1 root root 1168138808 May 7 15:15 model-00004-of-00004.safetensors
-rw-r--r--. 1 root root 23950 May 7 15:11 model.safetensors.index.json
-rw-r--r--. 1 root root 73 May 7 15:29 special_tokens_map.json
-rw-r--r--. 1 root root 50941 May 7 15:29 tokenizer_config.json
-rw-r--r--. 1 root root 9084490 May 7 15:29 tokenizer.json
(python310) [root@localhost]# ll llama_3_8b_sq/
total 9392932
-rw-r--r--. 1 root root 2101347573 May 7 15:19 best_model.1.pt
-rw-r--r--. 1 root root 2067952307 May 7 15:19 best_model.2.pt
-rw-r--r--. 1 root root 2083492751 May 7 15:19 best_model.3.pt
-rw-r--r--. 1 root root 2066833130 May 7 15:19 best_model.4.pt
-rw-r--r--. 1 root root 1298697078 May 7 15:16 best_model.5.pt
-rw-r--r--. 1 root root 265 May 7 15:15 checkpoints.json
-rw-r--r--. 1 root root 786 May 7 15:15 config.json
-rw-r--r--. 1 root root 17586 May 7 15:15 sq_replace_modules.json
(python310) [root@localhost]# ll saved_results/
total 11092580
-rw-r--r--. 1 root root 39665 May 30 23:47 all_checkpoint_keys.json
-rw-r--r--. 1 root root 994 May 30 23:47 config.json
-rw-r--r--. 1 root root 121 May 30 23:47 generation_config.json
-rw-r--r--. 1 root root 4945517000 May 30 23:47 model-00001-of-00003.safetensors
-rw-r--r--. 1 root root 4302654760 May 30 23:47 model-00002-of-00003.safetensors
-rw-r--r--. 1 root root 2101346432 May 30 23:47 model-00003-of-00003.safetensors
-rw-r--r--. 1 root root 78236 May 30 23:47 model.safetensors.index.json
-rw-r--r--. 1 root root 236 May 30 23:47 quantize_config.json
-rw-r--r--. 1 root root 301 May 30 23:47 special_tokens_map.json
-rw-r--r--. 1 root root 50941 May 30 23:47 tokenizer_config.json
-rw-r--r--. 1 root root 9084463 May 30 23:47 tokenizer.json
Hey! I'm trying to quantize llama3-8b to FP4 by running the code in
https://github.com/intel/intel-extension-for-transformers/blob/main/examples/huggingface/pytorch/text-generation/quantization/run_generation_cpu_woq.py
Here's my run command:
python run_generation_cpu_woq.py --model llama_3_8b --woq --woq_algo GPTQ --bits 4 --weight_dtype fp4_e2m1 --scale_dtype fp32 --compute_dtype fp32 --benchmark --accuracy --load_in_4bit
In run_generation_cpu_woq.py, the input parameter
weight_dtype
of FP4 has onlyfp4_e2m1_bnb
andfp4_e2m1
. However, /intel_extension_for_transformers/transformers/llm/quantization/utils.py:269 checks whether it isfp4
. The branch finally reaches line 281, enters the INT type logic, and reports the error RuntimeError: Qbits: only support Integer WOQ in PACKQ.How do I run the program to do the FP4 quantization?
Related software packages and versions are as follows:
intel-extension-for-pytorch 2.3.0 intel-extension-for-transformers 1.4.1 neural_compressor 2.6.dev30+g7c0b700 neural-speed 1.0
Here's the error:
2024-05-28 22:27:33 [INF0] Save deploy yaml to /home/nc-workspace/2024-05-28 14-12-59/deploy.yaml /root/miniconda3/envs/python310/Lib/python3.10/site-packages/bitsandbytes/cextension py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8- bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. warn("The installed version of bitsandbytes was compiled without GPU support. /root/miniconda3/envs/python310/Lib/python3.10/site-packages/bitsandbytes/Libbitsandbytes cpu.so: undefined symbol: cadam32bit grad fp32 Traceback (most recent call last): File "/home/psi/run generation cpu woq.py", line 299, in <module> user model = AutoModelForCausalLM.from pretrained( File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/transformers/modeling/modeling_auto.py", line 672, in from_pretrained model = convert to quantized model( File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel extension for transformers/transformers/llm/quantization/utils.py", line 563, in convert to quantized mo del q_model = replace linear(inc model, None, None, config, device=device) File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel extension for transformers/transformers/llm/quantization/utils.py", line 119, in replace linear model, is replaced = replace linear( File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/transformers/llm/quantization/utils.py", line 316, in _replace_linear Fito t9-SPASR2enaPAS/thon810/1ib/pythan3.10/sie packages/intol_extensien_for_transformers/transformers/LlM/quentization/utils.py", Line 316, in _replace_Linear产 , is replaced = replace linear( FiTe "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/transformers/llm/quantization/utils.py", line 316, in _replace_linear _, is_replaced = _replace_linear( [Previous line repeated 1 more time] File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel extension for transformers/transformers/llm/quantization/utils.py", line 289, in replace linear model. modules[name].set weights bias( File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/transformers/llm/quantization/nn/modules.py", line 226, in set_weights bias packw = qbits.repack quantized weight( RuntimeError: Qbits: only support Integer WOQ in PACKQ Exception raised from bestla_packq at /tmp/pip-install-ue0q49gf/intel-extension-for-transformers _la94d9ed4b1947cca3046b78bd887212/intel_extension_for_transformers/qbits/dispatcher /src/bestla packq_impl.cpp:185 (most recent call first): frame #O: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7faa71244167 in /root/miniconda3/envs/python310/Lib/python3.10/site-packages/torch/lib/libc10.so frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7faa711f432e in /root/miniconda3/envs/python310/Lib/python3.10/site-packages/t orch/lib/libc10.so) frame #2: woq::bestla packq(woq::repack quantized weight param*, woq::repack quantized weight ctx*, woq::WOQ TASK) + 0x22d (0x7fa9e875097d in /root/miniconda3/envs/python310/Lib/p ython3.10/site-packages/intel extension for transformers/qbits py.cpython-310-x86 64-Linux-gnu.so) frame #3: <unknown function> + Oxae20c (Ox7fa9e871b20c in /root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/qbits_py.cpython-310-x86_64 -inux-gnu.so) frame #4: <unknown function> + Oxaef34 (Ox7fa9e871bf34 in /root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/qbits_py.cpython-310-x86_64 -linux-gnu.so) frame #5: <unknown function> + Oxa7822 (Ox7fa9e8714822 in /root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/qbits_py.cpython-310-x86_64 -linux-gnu.so) frame #6: python() [0x4fd907] <omitting python frames> frame #9: python() [0x5095ce] frame #25: python() [0x5095ce] frame #27: python() [0x5951c2]
will fix it in a PR soon.
@PenghuiCheng After the judgment of fp4_e2m1_bnb
and fp4_e2m1
are added, the quantization can be completed and the quantization model can be output.
However, when the quantization model is loaded, the error RuntimeError: "normal_kernel_cpu" not implemented for'Char'
is reported.
The error is reported from the transformers/models/llama/modeling_llama.py:803
. The call stack is as follows:
/intel_extension_for_transformers/transformers/modeling/modeling_auto.py:451 -> cls.load_low_bit
/intel_extension_for_transformers/transformers/modeling/modeling_auto.py:1423 -> model_class._load_pretrained_model
/transformers/modeling_utils.py:3772 -> _initialize_weights -> _init_weights(1613 line)
/transformers/models/llama/modeling_llama.py:803 -> module.weight.data.normal_
Hey! I'm trying to quantize llama3-8b to FP4 by running the code in
https://github.com/intel/intel-extension-for-transformers/blob/main/examples/huggingface/pytorch/text-generation/quantization/run_generation_cpu_woq.py
Here's my run command:python run_generation_cpu_woq.py --model llama_3_8b --woq --woq_algo GPTQ --bits 4 --weight_dtype fp4_e2m1 --scale_dtype fp32 --compute_dtype fp32 --benchmark --accuracy --load_in_4bit
In run_generation_cpu_woq.py, the input parameterweight_dtype
of FP4 has onlyfp4_e2m1_bnb
andfp4_e2m1
. However, /intel_extension_for_transformers/transformers/llm/quantization/utils.py:269 checks whether it isfp4
. The branch finally reaches line 281, enters the INT type logic, and reports the error RuntimeError: Qbits: only support Integer WOQ in PACKQ.
How do I run the program to do the FP4 quantization?
Related software packages and versions are as follows:intel-extension-for-pytorch 2.3.0 intel-extension-for-transformers 1.4.1 neural_compressor 2.6.dev30+g7c0b700 neural-speed 1.0
Here's the error:
2024-05-28 22:27:33 [INF0] Save deploy yaml to /home/nc-workspace/2024-05-28 14-12-59/deploy.yaml /root/miniconda3/envs/python310/Lib/python3.10/site-packages/bitsandbytes/cextension py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8- bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. warn("The installed version of bitsandbytes was compiled without GPU support. /root/miniconda3/envs/python310/Lib/python3.10/site-packages/bitsandbytes/Libbitsandbytes cpu.so: undefined symbol: cadam32bit grad fp32 Traceback (most recent call last): File "/home/psi/run generation cpu woq.py", line 299, in <module> user model = AutoModelForCausalLM.from pretrained( File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/transformers/modeling/modeling_auto.py", line 672, in from_pretrained model = convert to quantized model( File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel extension for transformers/transformers/llm/quantization/utils.py", line 563, in convert to quantized mo del q_model = replace linear(inc model, None, None, config, device=device) File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel extension for transformers/transformers/llm/quantization/utils.py", line 119, in replace linear model, is replaced = replace linear( File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/transformers/llm/quantization/utils.py", line 316, in _replace_linear Fito t9-SPASR2enaPAS/thon810/1ib/pythan3.10/sie packages/intol_extensien_for_transformers/transformers/LlM/quentization/utils.py", Line 316, in _replace_Linear产 , is replaced = replace linear( FiTe "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/transformers/llm/quantization/utils.py", line 316, in _replace_linear _, is_replaced = _replace_linear( [Previous line repeated 1 more time] File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel extension for transformers/transformers/llm/quantization/utils.py", line 289, in replace linear model. modules[name].set weights bias( File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/transformers/llm/quantization/nn/modules.py", line 226, in set_weights bias packw = qbits.repack quantized weight( RuntimeError: Qbits: only support Integer WOQ in PACKQ Exception raised from bestla_packq at /tmp/pip-install-ue0q49gf/intel-extension-for-transformers _la94d9ed4b1947cca3046b78bd887212/intel_extension_for_transformers/qbits/dispatcher /src/bestla packq_impl.cpp:185 (most recent call first): frame #O: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7faa71244167 in /root/miniconda3/envs/python310/Lib/python3.10/site-packages/torch/lib/libc10.so frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7faa711f432e in /root/miniconda3/envs/python310/Lib/python3.10/site-packages/t orch/lib/libc10.so) frame #2: woq::bestla packq(woq::repack quantized weight param*, woq::repack quantized weight ctx*, woq::WOQ TASK) + 0x22d (0x7fa9e875097d in /root/miniconda3/envs/python310/Lib/p ython3.10/site-packages/intel extension for transformers/qbits py.cpython-310-x86 64-Linux-gnu.so) frame #3: <unknown function> + Oxae20c (Ox7fa9e871b20c in /root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/qbits_py.cpython-310-x86_64 -inux-gnu.so) frame #4: <unknown function> + Oxaef34 (Ox7fa9e871bf34 in /root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/qbits_py.cpython-310-x86_64 -linux-gnu.so) frame #5: <unknown function> + Oxa7822 (Ox7fa9e8714822 in /root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/qbits_py.cpython-310-x86_64 -linux-gnu.so) frame #6: python() [0x4fd907] <omitting python frames> frame #9: python() [0x5095ce] frame #25: python() [0x5095ce] frame #27: python() [0x5951c2]
will fix it in a PR soon.
PR is:
#1594
model_class._load_pretrained_model
please try it with PR: #1594