intel/neural-speed

AVX_VNNI Numeric Bug?

parvizmp opened this issue · 1 comments

$ git log
commit 9e20bd1072bb927613d55779b09752b05a348a9b (HEAD -> main, origin/main, origin/HEAD)
Author: luoyu-intel <yu.luo@intel.com>
Date:   Fri Jan 5 10:44:12 2024 +0800

    make UT OFF as default. (#25)

    * make UT OFF as default.

    * change pointer to const void*

Generate weights:

$ python scripts/convert.py --outtype f32 --outfile llama2.ne-f32.bin ~/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-chat-hf/snapshots/c1b0db933684edbfe29a06fa47eb19cc48025e93
$ python scripts/quantize.py --model_name llama2 --model_file llama2.ne-f32.bin --out_file llama2.ne.weight-int4.group-128.compute-int8.bin --nthread 56 --weight_dtype int4 --group_size 128 --compute_dtype int8

If I run natively on SPR:

./build/bin/run_llama -s 0 --model_name llama -m llama2.ne.weight-int4.group-128.compute-int8.bin -c 512 -b 1024 -n 4 -t 1 -p "Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun."
...
Welcome to use the llama on the ITREX!
main: seed  = 0
AVX:1 AVX2:1 AVX512F:1 AVX_VNNI:1 AVX512_VNNI:1 AMX_INT8:1 AMX_BF16:1 AVX512_BF16:1 AVX512_FP16:1
model.cpp: loading model from llama2.ne.weight-int4.group-128.compute-int8.bin
init: n_vocab    = 32000
init: n_embd     = 4096
init: n_mult     = 256
init: n_head     = 32
init: n_head_kv  = 32
init: n_layer    = 32
init: n_rot      = 128
init: n_ff       = 11008
init: n_parts    = 1
load: ne ctx size = 3536.38 MB
load: mem required  = 5586.38 MB (+ memory per state)
...................................................................................................
model_init_from_file: support_bestla_kv = 1
model_init_from_file: kv self size =  276.00 MB

system_info: n_threads = 1 / 56 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | F16C = 1 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 1024, n_predict = 4, n_keep = 0


 Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun. One day, she

However, I'm only interested in AVX2/AVX_VNNI.
So instead I just disable AVX512 to exercise the AVX2 + AVX_VNNI paths:

218   inline bool AVX() { return mHasAVX; }
219   inline bool AVX2() { return mHasAVX2; }
220   inline bool AVX_VNNI() { return mHasAVX_VNNI; }
221   inline bool AVX512F() { return false && mHasAVX512F; }
222   inline bool AVX512_VNNI() { return false && mHasAVX512_VNNI; }
223   inline bool AMX_INT8() { return false && mHasAMX_INT8; }
224   inline bool AMX_BF16() { return false && mHasAMX_BF16; }
225   inline bool AVX512_BF16() { return false && mHasAVX512_BF16; }
226   inline bool AVX512_FP16() { return false && mHasAVX512_FP16; }

Now I see:

Welcome to use the llama on the ITREX!
main: seed  = 0
AVX:1 AVX2:1 AVX512F:1 AVX_VNNI:1 AVX512_VNNI:1 AMX_INT8:1 AMX_BF16:1 AVX512_BF16:1 AVX512_FP16:1
model.cpp: loading model from llama2.ne.weight-int4.group-128.compute-int8.bin
init: n_vocab    = 32000
init: n_embd     = 4096
init: n_mult     = 256
init: n_head     = 32
init: n_head_kv  = 32
init: n_layer    = 32
init: n_rot      = 128
init: n_ff       = 11008
init: n_parts    = 1
load: ne ctx size = 3536.38 MB
load: mem required  = 5586.38 MB (+ memory per state)
...................................................................................................
model_init_from_file: support_bestla_kv = 0
model_init_from_file: kv self size =  128.00 MB

system_info: n_threads = 1 / 56 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | F16C = 1 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 1024, n_predict = 4, n_keep = 0


 Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun.tembretembretembre

Note the bogus generated tokens tembretembretembre vs One day, she.

Am I missing something or is there a numeric issue with the AVX2/AVX_VNNI path?

FWIW I didn't see this behavior with a previous version: https://github.com/intel/intel-extension-for-transformers/tree/c087c74da00711fcac37014cc8aea443c4b5fa82/intel_extension_for_transformers/llm/runtime/graph

We could use SDE for quickly comparing generated tokens for different ISAs - however if I run under SDE with spoofed CPU ID for Meteor Lake (i.e. sde -mtl ...) we see a segfault when its attempting to determine the hybrid config:

(gdb) c
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x00007fffb25c85aa in bestla::device::CpuDevice::CpuDevice (this=0x7fffb293d940 <bestla::device::CpuDevice::getInstance()::instance>) at /home/parvizmp/neural-speed/bestla/bestla/bestla_device.h:306
306             E_L1Cache = L1[E_core[0]];

You may quantize an AVX-VNNI model with your new code. The model quantized with AMX is different with AVX-VNNI.
MTL has different CPU micro arch called P-core and E-core. The number of P-core and E-core depends on its SKU. There is a bug when you only use P-core in a hybrid CPU by setting CPU affinity out of the program like numactrl. SDE seems to only emulate the P-cores in MLT and will trigger the bug. We will fix the bug soon.