Question about data shape difference between quantization and forward
Opened this issue · 0 comments
sleepwalker2017 commented
I run auto gptq using llama-7b, when doing model quantization, I see the shape of a layer as follows:
scale: [4096, 32]
zero: [4096, 32]
g_idx: [4096]
Then I think GPTQ uses groups, it quantize 128 columns as a group, is that right?
But when I do inference, I find the shapes changed:
weight: [32, 128, 4096] int8
zeros: [32, 1, 4096] int8
scales: [32, 1, 4096] fp16
why is that? why the zeros and scales transposed?
I'm very confused about this.
How is the group partition? using multiple rows as a group or multiple columns?