Question about data shape difference between quantization and forward

Question

Opened this issue a month ago · 0 comments

I run auto gptq using llama-7b, when doing model quantization, I see the shape of a layer as follows:

scale:  [4096, 32]
zero:   [4096, 32]
g_idx: [4096]

Then I think GPTQ uses groups, it quantize 128 columns as a group, is that right?

But when I do inference, I find the shapes changed:

weight: [32, 128, 4096]  int8
zeros:  [32, 1, 4096]  int8
scales: [32, 1, 4096] fp16

why is that? why the zeros and scales transposed?

I'm very confused about this.

How is the group partition? using multiple rows as a group or multiple columns?