intel/intel-extension-for-transformers

Cannot finish FP4 quantization: `RuntimeError: Qbits: only support Integer WOQ in PACKQ`

PhzCode opened this issue · 6 comments

Hey! I'm trying to quantize llama3-8b to FP4 by running the code in https://github.com/intel/intel-extension-for-transformers/blob/main/examples/huggingface/pytorch/text-generation/quantization/run_generation_cpu_woq.py

Here's my run command:
python run_generation_cpu_woq.py --model llama_3_8b --woq --woq_algo GPTQ --bits 4 --weight_dtype fp4_e2m1 --scale_dtype fp32 --compute_dtype fp32 --benchmark --accuracy --load_in_4bit

In run_generation_cpu_woq.py, the input parameter weight_dtype of FP4 has only fp4_e2m1_bnb and fp4_e2m1. However, /intel_extension_for_transformers/transformers/llm/quantization/utils.py:269 checks whether it is fp4. The branch finally reaches line 281, enters the INT type logic, and reports the error RuntimeError: Qbits: only support Integer WOQ in PACKQ.

How do I run the program to do the FP4 quantization?

Related software packages and versions are as follows:

intel-extension-for-pytorch      2.3.0
intel-extension-for-transformers 1.4.1
neural_compressor                2.6.dev30+g7c0b700
neural-speed                     1.0

Here's the error:

2024-05-28 22:27:33 [INF0] Save deploy yaml to /home/nc-workspace/2024-05-28 14-12-59/deploy.yaml
/root/miniconda3/envs/python310/Lib/python3.10/site-packages/bitsandbytes/cextension py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-
bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support.
/root/miniconda3/envs/python310/Lib/python3.10/site-packages/bitsandbytes/Libbitsandbytes cpu.so: undefined symbol: cadam32bit grad fp32
Traceback (most recent call last):
File "/home/psi/run generation cpu woq.py", line 299, in <module>
user model = AutoModelForCausalLM.from pretrained(
File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/transformers/modeling/modeling_auto.py", line 672, in from_pretrained
model = convert to quantized model(
File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel extension for transformers/transformers/llm/quantization/utils.py", line 563, in convert to quantized mo
del
q_model = replace linear(inc model, None, None, config, device=device)
File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel extension for transformers/transformers/llm/quantization/utils.py", line 119, in replace linear
model, is replaced =  replace linear(
File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/transformers/llm/quantization/utils.py", line 316, in _replace_linear
Fito t9-SPASR2enaPAS/thon810/1ib/pythan3.10/sie packages/intol_extensien_for_transformers/transformers/LlM/quentization/utils.py", Line 316, in _replace_Linear产
, is replaced = replace linear(
FiTe "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/transformers/llm/quantization/utils.py", line 316, in _replace_linear
_, is_replaced = _replace_linear(
[Previous line repeated 1 more time]
File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel extension for transformers/transformers/llm/quantization/utils.py", line 289, in  replace linear
model. modules[name].set weights bias(
File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/transformers/llm/quantization/nn/modules.py", line 226, in set_weights bias
packw = qbits.repack quantized weight(
RuntimeError: Qbits: only support Integer WOQ in PACKQ
Exception raised from bestla_packq at /tmp/pip-install-ue0q49gf/intel-extension-for-transformers _la94d9ed4b1947cca3046b78bd887212/intel_extension_for_transformers/qbits/dispatcher
/src/bestla packq_impl.cpp:185 (most recent call first):
frame #O: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7faa71244167 in /root/miniconda3/envs/python310/Lib/python3.10/site-packages/torch/lib/libc10.so
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7faa711f432e in /root/miniconda3/envs/python310/Lib/python3.10/site-packages/t
orch/lib/libc10.so)
frame #2: woq::bestla packq(woq::repack quantized weight param*, woq::repack quantized weight ctx*, woq::WOQ TASK) + 0x22d (0x7fa9e875097d in /root/miniconda3/envs/python310/Lib/p
ython3.10/site-packages/intel extension for transformers/qbits py.cpython-310-x86 64-Linux-gnu.so)
frame #3: <unknown function> + Oxae20c (Ox7fa9e871b20c in /root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/qbits_py.cpython-310-x86_64
-inux-gnu.so)
frame #4: <unknown function> + Oxaef34 (Ox7fa9e871bf34 in /root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/qbits_py.cpython-310-x86_64
-linux-gnu.so)
frame #5: <unknown function> + Oxa7822 (Ox7fa9e8714822 in /root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/qbits_py.cpython-310-x86_64
-linux-gnu.so)
frame #6: python() [0x4fd907]
<omitting python frames>
frame #9: python() [0x5095ce]
frame #25: python() [0x5095ce]
frame #27: python() [0x5951c2]

In the /intel_extension_for_transformers/transformers/llm/quantization/utils.py file, I added the content of fp4_e2m1_bnb, fp4_e2m1 judgment to the place where FP4 needs to be judged. Finally, the quantization is successful and the model is output. However, an error occurs during inference.

Output as follow:

2024-05-30 23:36:32 [INFO] Quantization done
2024-05-30 23:36:41 [INFO] GPTQ quantizing done.
2024-05-30 23:36:53 [INFO] |******Mixed Precision Statistics******|
2024-05-30 23:36:53 [INFO] +---------+-------+-----------+--------+
2024-05-30 23:36:53 [INFO] | Op Type | Total | A32W4G128 |  FP32  |
2024-05-30 23:36:53 [INFO] +---------+-------+-----------+--------+
2024-05-30 23:36:53 [INFO] |  Linear |  225  |    224    |   1    |
2024-05-30 23:36:53 [INFO] +---------+-------+-----------+--------+
2024-05-30 23:36:53 [INFO] Pass quantize model elapsed time: 28941231.53 ms
2024-05-30 23:36:53 [INFO] Save tuning history to /home/nc_workspace/2024-05-30_15-33-54/./history.snapshot.
2024-05-30 23:36:53 [INFO] [Strategy] Found the model meets accuracy requirements, ending the tuning process.
2024-05-30 23:36:53 [INFO] Specified timeout or max trials is reached! Found a quantized model which meet accuracy goal. Exit.
2024-05-30 23:36:53 [INFO] Save deploy yaml to /home/nc_workspace/2024-05-30_15-33-54/deploy.yaml
/root/miniconda3/envs/python310/lib/python3.10/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
2024-05-30 23:37:06 [INFO] WeightOnlyQuant done.
2024-05-30 23:47:33 [INFO] Configuration saved in ./saved_results/quantize_config.json
2024-05-30 23:47:33 [INFO] quantization_config: {'bits': 4, 'compute_dtype': 'fp32', 'damp_percent': 0.01, 'desc_act': False, 'group_size': 128, 'llm_int8_skip_modules': [], 'quant_method': 'gptq', 'scale_dtype': 'fp32', 'sym': True, 'weight_dtype': 'fp4_e2m1'}
2024-05-30 23:47:33 [INFO] loading weights file ./saved_results/model.safetensors.index.json
/root/miniconda3/envs/python310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
Loading model from:  ./saved_results
Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]
2024-05-30 23:47:33 [ERROR] Trying to set a tensor of shape torch.Size([3584, 4096]) in "qweight" (which has shape torch.Size([1792, 4096])), this look incorrect.
2024-05-30 23:47:33 [ERROR] Saved low bit model loading failed, please check your model.

I suspect the quantization is also wrong. Quantize the llama3-8b model as well, using SQ for int8 quantization (using the run_generation_sq.py file) and GPTQ for FP4 quantization (using the run_generation_cpu_woq.py file). The following output shows that the compression ratio of FP4 is lower than that of INT8.

Output as follow:
saved_results directory is FP4

(python310) [root@localhost]# ll llama_3_8b
total 15693104
-rw-r--r--. 1 root root        654 May  7 15:11 config.json
-rw-r--r--. 1 root root        126 May  7 15:11 generation_config.json
-rw-r--r--. 1 root root 4976698672 May  7 15:32 model-00001-of-00004.safetensors
-rw-r--r--. 1 root root 4999802720 May  7 15:32 model-00002-of-00004.safetensors
-rw-r--r--. 1 root root 4915916176 May  7 15:32 model-00003-of-00004.safetensors
-rw-r--r--. 1 root root 1168138808 May  7 15:15 model-00004-of-00004.safetensors
-rw-r--r--. 1 root root      23950 May  7 15:11 model.safetensors.index.json
-rw-r--r--. 1 root root         73 May  7 15:29 special_tokens_map.json
-rw-r--r--. 1 root root      50941 May  7 15:29 tokenizer_config.json
-rw-r--r--. 1 root root    9084490 May  7 15:29 tokenizer.json
(python310) [root@localhost]# ll llama_3_8b_sq/
total 9392932
-rw-r--r--. 1 root root 2101347573 May  7 15:19 best_model.1.pt
-rw-r--r--. 1 root root 2067952307 May  7 15:19 best_model.2.pt
-rw-r--r--. 1 root root 2083492751 May  7 15:19 best_model.3.pt
-rw-r--r--. 1 root root 2066833130 May  7 15:19 best_model.4.pt
-rw-r--r--. 1 root root 1298697078 May  7 15:16 best_model.5.pt
-rw-r--r--. 1 root root        265 May  7 15:15 checkpoints.json
-rw-r--r--. 1 root root        786 May  7 15:15 config.json
-rw-r--r--. 1 root root      17586 May  7 15:15 sq_replace_modules.json
(python310) [root@localhost]# ll saved_results/
total 11092580
-rw-r--r--. 1 root root      39665 May 30 23:47 all_checkpoint_keys.json
-rw-r--r--. 1 root root        994 May 30 23:47 config.json
-rw-r--r--. 1 root root        121 May 30 23:47 generation_config.json
-rw-r--r--. 1 root root 4945517000 May 30 23:47 model-00001-of-00003.safetensors
-rw-r--r--. 1 root root 4302654760 May 30 23:47 model-00002-of-00003.safetensors
-rw-r--r--. 1 root root 2101346432 May 30 23:47 model-00003-of-00003.safetensors
-rw-r--r--. 1 root root      78236 May 30 23:47 model.safetensors.index.json
-rw-r--r--. 1 root root        236 May 30 23:47 quantize_config.json
-rw-r--r--. 1 root root        301 May 30 23:47 special_tokens_map.json
-rw-r--r--. 1 root root      50941 May 30 23:47 tokenizer_config.json
-rw-r--r--. 1 root root    9084463 May 30 23:47 tokenizer.json

Hey! I'm trying to quantize llama3-8b to FP4 by running the code in https://github.com/intel/intel-extension-for-transformers/blob/main/examples/huggingface/pytorch/text-generation/quantization/run_generation_cpu_woq.py

Here's my run command: python run_generation_cpu_woq.py --model llama_3_8b --woq --woq_algo GPTQ --bits 4 --weight_dtype fp4_e2m1 --scale_dtype fp32 --compute_dtype fp32 --benchmark --accuracy --load_in_4bit

In run_generation_cpu_woq.py, the input parameter weight_dtype of FP4 has only fp4_e2m1_bnb and fp4_e2m1. However, /intel_extension_for_transformers/transformers/llm/quantization/utils.py:269 checks whether it is fp4. The branch finally reaches line 281, enters the INT type logic, and reports the error RuntimeError: Qbits: only support Integer WOQ in PACKQ.

How do I run the program to do the FP4 quantization?

Related software packages and versions are as follows:

intel-extension-for-pytorch      2.3.0
intel-extension-for-transformers 1.4.1
neural_compressor                2.6.dev30+g7c0b700
neural-speed                     1.0

Here's the error:

2024-05-28 22:27:33 [INF0] Save deploy yaml to /home/nc-workspace/2024-05-28 14-12-59/deploy.yaml
/root/miniconda3/envs/python310/Lib/python3.10/site-packages/bitsandbytes/cextension py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-
bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support.
/root/miniconda3/envs/python310/Lib/python3.10/site-packages/bitsandbytes/Libbitsandbytes cpu.so: undefined symbol: cadam32bit grad fp32
Traceback (most recent call last):
File "/home/psi/run generation cpu woq.py", line 299, in <module>
user model = AutoModelForCausalLM.from pretrained(
File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/transformers/modeling/modeling_auto.py", line 672, in from_pretrained
model = convert to quantized model(
File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel extension for transformers/transformers/llm/quantization/utils.py", line 563, in convert to quantized mo
del
q_model = replace linear(inc model, None, None, config, device=device)
File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel extension for transformers/transformers/llm/quantization/utils.py", line 119, in replace linear
model, is replaced =  replace linear(
File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/transformers/llm/quantization/utils.py", line 316, in _replace_linear
Fito t9-SPASR2enaPAS/thon810/1ib/pythan3.10/sie packages/intol_extensien_for_transformers/transformers/LlM/quentization/utils.py", Line 316, in _replace_Linear产
, is replaced = replace linear(
FiTe "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/transformers/llm/quantization/utils.py", line 316, in _replace_linear
_, is_replaced = _replace_linear(
[Previous line repeated 1 more time]
File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel extension for transformers/transformers/llm/quantization/utils.py", line 289, in  replace linear
model. modules[name].set weights bias(
File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/transformers/llm/quantization/nn/modules.py", line 226, in set_weights bias
packw = qbits.repack quantized weight(
RuntimeError: Qbits: only support Integer WOQ in PACKQ
Exception raised from bestla_packq at /tmp/pip-install-ue0q49gf/intel-extension-for-transformers _la94d9ed4b1947cca3046b78bd887212/intel_extension_for_transformers/qbits/dispatcher
/src/bestla packq_impl.cpp:185 (most recent call first):
frame #O: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7faa71244167 in /root/miniconda3/envs/python310/Lib/python3.10/site-packages/torch/lib/libc10.so
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7faa711f432e in /root/miniconda3/envs/python310/Lib/python3.10/site-packages/t
orch/lib/libc10.so)
frame #2: woq::bestla packq(woq::repack quantized weight param*, woq::repack quantized weight ctx*, woq::WOQ TASK) + 0x22d (0x7fa9e875097d in /root/miniconda3/envs/python310/Lib/p
ython3.10/site-packages/intel extension for transformers/qbits py.cpython-310-x86 64-Linux-gnu.so)
frame #3: <unknown function> + Oxae20c (Ox7fa9e871b20c in /root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/qbits_py.cpython-310-x86_64
-inux-gnu.so)
frame #4: <unknown function> + Oxaef34 (Ox7fa9e871bf34 in /root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/qbits_py.cpython-310-x86_64
-linux-gnu.so)
frame #5: <unknown function> + Oxa7822 (Ox7fa9e8714822 in /root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/qbits_py.cpython-310-x86_64
-linux-gnu.so)
frame #6: python() [0x4fd907]
<omitting python frames>
frame #9: python() [0x5095ce]
frame #25: python() [0x5095ce]
frame #27: python() [0x5951c2]

will fix it in a PR soon.

@PenghuiCheng After the judgment of fp4_e2m1_bnb and fp4_e2m1 are added, the quantization can be completed and the quantization model can be output.
However, when the quantization model is loaded, the error RuntimeError: "normal_kernel_cpu" not implemented for'Char' is reported.

The error is reported from the transformers/models/llama/modeling_llama.py:803. The call stack is as follows:

/intel_extension_for_transformers/transformers/modeling/modeling_auto.py:451 -> cls.load_low_bit
/intel_extension_for_transformers/transformers/modeling/modeling_auto.py:1423 -> model_class._load_pretrained_model
/transformers/modeling_utils.py:3772 -> _initialize_weights -> _init_weights(1613 line)
/transformers/models/llama/modeling_llama.py:803 -> module.weight.data.normal_

Hey! I'm trying to quantize llama3-8b to FP4 by running the code in https://github.com/intel/intel-extension-for-transformers/blob/main/examples/huggingface/pytorch/text-generation/quantization/run_generation_cpu_woq.py
Here's my run command: python run_generation_cpu_woq.py --model llama_3_8b --woq --woq_algo GPTQ --bits 4 --weight_dtype fp4_e2m1 --scale_dtype fp32 --compute_dtype fp32 --benchmark --accuracy --load_in_4bit
In run_generation_cpu_woq.py, the input parameter weight_dtype of FP4 has only fp4_e2m1_bnb and fp4_e2m1. However, /intel_extension_for_transformers/transformers/llm/quantization/utils.py:269 checks whether it is fp4. The branch finally reaches line 281, enters the INT type logic, and reports the error RuntimeError: Qbits: only support Integer WOQ in PACKQ.
How do I run the program to do the FP4 quantization?
Related software packages and versions are as follows:

intel-extension-for-pytorch      2.3.0
intel-extension-for-transformers 1.4.1
neural_compressor                2.6.dev30+g7c0b700
neural-speed                     1.0

Here's the error:

2024-05-28 22:27:33 [INF0] Save deploy yaml to /home/nc-workspace/2024-05-28 14-12-59/deploy.yaml
/root/miniconda3/envs/python310/Lib/python3.10/site-packages/bitsandbytes/cextension py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-
bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support.
/root/miniconda3/envs/python310/Lib/python3.10/site-packages/bitsandbytes/Libbitsandbytes cpu.so: undefined symbol: cadam32bit grad fp32
Traceback (most recent call last):
File "/home/psi/run generation cpu woq.py", line 299, in <module>
user model = AutoModelForCausalLM.from pretrained(
File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/transformers/modeling/modeling_auto.py", line 672, in from_pretrained
model = convert to quantized model(
File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel extension for transformers/transformers/llm/quantization/utils.py", line 563, in convert to quantized mo
del
q_model = replace linear(inc model, None, None, config, device=device)
File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel extension for transformers/transformers/llm/quantization/utils.py", line 119, in replace linear
model, is replaced =  replace linear(
File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/transformers/llm/quantization/utils.py", line 316, in _replace_linear
Fito t9-SPASR2enaPAS/thon810/1ib/pythan3.10/sie packages/intol_extensien_for_transformers/transformers/LlM/quentization/utils.py", Line 316, in _replace_Linear产
, is replaced = replace linear(
FiTe "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/transformers/llm/quantization/utils.py", line 316, in _replace_linear
_, is_replaced = _replace_linear(
[Previous line repeated 1 more time]
File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel extension for transformers/transformers/llm/quantization/utils.py", line 289, in  replace linear
model. modules[name].set weights bias(
File "/root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/transformers/llm/quantization/nn/modules.py", line 226, in set_weights bias
packw = qbits.repack quantized weight(
RuntimeError: Qbits: only support Integer WOQ in PACKQ
Exception raised from bestla_packq at /tmp/pip-install-ue0q49gf/intel-extension-for-transformers _la94d9ed4b1947cca3046b78bd887212/intel_extension_for_transformers/qbits/dispatcher
/src/bestla packq_impl.cpp:185 (most recent call first):
frame #O: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7faa71244167 in /root/miniconda3/envs/python310/Lib/python3.10/site-packages/torch/lib/libc10.so
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7faa711f432e in /root/miniconda3/envs/python310/Lib/python3.10/site-packages/t
orch/lib/libc10.so)
frame #2: woq::bestla packq(woq::repack quantized weight param*, woq::repack quantized weight ctx*, woq::WOQ TASK) + 0x22d (0x7fa9e875097d in /root/miniconda3/envs/python310/Lib/p
ython3.10/site-packages/intel extension for transformers/qbits py.cpython-310-x86 64-Linux-gnu.so)
frame #3: <unknown function> + Oxae20c (Ox7fa9e871b20c in /root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/qbits_py.cpython-310-x86_64
-inux-gnu.so)
frame #4: <unknown function> + Oxaef34 (Ox7fa9e871bf34 in /root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/qbits_py.cpython-310-x86_64
-linux-gnu.so)
frame #5: <unknown function> + Oxa7822 (Ox7fa9e8714822 in /root/miniconda3/envs/python310/Lib/python3.10/site-packages/intel_extension_for_transformers/qbits_py.cpython-310-x86_64
-linux-gnu.so)
frame #6: python() [0x4fd907]
<omitting python frames>
frame #9: python() [0x5095ce]
frame #25: python() [0x5095ce]
frame #27: python() [0x5951c2]

will fix it in a PR soon.
PR is:
#1594

model_class._load_pretrained_model

please try it with PR: #1594