AVX_VNNI Numeric Bug?
parvizmp opened this issue · 1 comments
$ git log
commit 9e20bd1072bb927613d55779b09752b05a348a9b (HEAD -> main, origin/main, origin/HEAD)
Author: luoyu-intel <yu.luo@intel.com>
Date: Fri Jan 5 10:44:12 2024 +0800
make UT OFF as default. (#25)
* make UT OFF as default.
* change pointer to const void*
Generate weights:
$ python scripts/convert.py --outtype f32 --outfile llama2.ne-f32.bin ~/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-chat-hf/snapshots/c1b0db933684edbfe29a06fa47eb19cc48025e93
$ python scripts/quantize.py --model_name llama2 --model_file llama2.ne-f32.bin --out_file llama2.ne.weight-int4.group-128.compute-int8.bin --nthread 56 --weight_dtype int4 --group_size 128 --compute_dtype int8
If I run natively on SPR:
./build/bin/run_llama -s 0 --model_name llama -m llama2.ne.weight-int4.group-128.compute-int8.bin -c 512 -b 1024 -n 4 -t 1 -p "Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun."
...
Welcome to use the llama on the ITREX!
main: seed = 0
AVX:1 AVX2:1 AVX512F:1 AVX_VNNI:1 AVX512_VNNI:1 AMX_INT8:1 AMX_BF16:1 AVX512_BF16:1 AVX512_FP16:1
model.cpp: loading model from llama2.ne.weight-int4.group-128.compute-int8.bin
init: n_vocab = 32000
init: n_embd = 4096
init: n_mult = 256
init: n_head = 32
init: n_head_kv = 32
init: n_layer = 32
init: n_rot = 128
init: n_ff = 11008
init: n_parts = 1
load: ne ctx size = 3536.38 MB
load: mem required = 5586.38 MB (+ memory per state)
...................................................................................................
model_init_from_file: support_bestla_kv = 1
model_init_from_file: kv self size = 276.00 MB
system_info: n_threads = 1 / 56 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | F16C = 1 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 1024, n_predict = 4, n_keep = 0
Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun. One day, she
However, I'm only interested in AVX2/AVX_VNNI.
So instead I just disable AVX512 to exercise the AVX2 + AVX_VNNI paths:
218 inline bool AVX() { return mHasAVX; }
219 inline bool AVX2() { return mHasAVX2; }
220 inline bool AVX_VNNI() { return mHasAVX_VNNI; }
221 inline bool AVX512F() { return false && mHasAVX512F; }
222 inline bool AVX512_VNNI() { return false && mHasAVX512_VNNI; }
223 inline bool AMX_INT8() { return false && mHasAMX_INT8; }
224 inline bool AMX_BF16() { return false && mHasAMX_BF16; }
225 inline bool AVX512_BF16() { return false && mHasAVX512_BF16; }
226 inline bool AVX512_FP16() { return false && mHasAVX512_FP16; }
Now I see:
Welcome to use the llama on the ITREX!
main: seed = 0
AVX:1 AVX2:1 AVX512F:1 AVX_VNNI:1 AVX512_VNNI:1 AMX_INT8:1 AMX_BF16:1 AVX512_BF16:1 AVX512_FP16:1
model.cpp: loading model from llama2.ne.weight-int4.group-128.compute-int8.bin
init: n_vocab = 32000
init: n_embd = 4096
init: n_mult = 256
init: n_head = 32
init: n_head_kv = 32
init: n_layer = 32
init: n_rot = 128
init: n_ff = 11008
init: n_parts = 1
load: ne ctx size = 3536.38 MB
load: mem required = 5586.38 MB (+ memory per state)
...................................................................................................
model_init_from_file: support_bestla_kv = 0
model_init_from_file: kv self size = 128.00 MB
system_info: n_threads = 1 / 56 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | F16C = 1 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 1024, n_predict = 4, n_keep = 0
Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun.tembretembretembre
Note the bogus generated tokens tembretembretembre
vs One day, she
.
Am I missing something or is there a numeric issue with the AVX2/AVX_VNNI path?
FWIW I didn't see this behavior with a previous version: https://github.com/intel/intel-extension-for-transformers/tree/c087c74da00711fcac37014cc8aea443c4b5fa82/intel_extension_for_transformers/llm/runtime/graph
We could use SDE for quickly comparing generated tokens for different ISAs - however if I run under SDE with spoofed CPU ID for Meteor Lake (i.e. sde -mtl ...
) we see a segfault when its attempting to determine the hybrid config:
(gdb) c
Continuing.
Program received signal SIGSEGV, Segmentation fault.
0x00007fffb25c85aa in bestla::device::CpuDevice::CpuDevice (this=0x7fffb293d940 <bestla::device::CpuDevice::getInstance()::instance>) at /home/parvizmp/neural-speed/bestla/bestla/bestla_device.h:306
306 E_L1Cache = L1[E_core[0]];
You may quantize an AVX-VNNI model with your new code. The model quantized with AMX is different with AVX-VNNI.
MTL has different CPU micro arch called P-core and E-core. The number of P-core and E-core depends on its SKU. There is a bug when you only use P-core in a hybrid CPU by setting CPU affinity out of the program like numactrl
. SDE seems to only emulate the P-cores in MLT and will trigger the bug. We will fix the bug soon.