AutoGPTQ/AutoGPTQ

gptq 4bit avg loss is large

moseshu opened this issue · 3 comments

I user aotugptq convert blfloat model to in4 the avg loss is a bit larger than int 8.
Model is Mixtral-8X7B
int8 loss is almost 0.0004

it means that the loss of int4 is more serious?

INT4 
INFO - Quantizing block_sparse_moe.experts.6.w3 in layer 29/32...
2024-04-18 09:50:18 INFO [auto_gptq.modeling._base] Quantizing block_sparse_moe.experts.6.w3 in layer 29/32...
2024-04-18 09:50:19 INFO [auto_gptq.quantization.gptq] duration: 1.210331678390503
2024-04-18 09:50:19 INFO [auto_gptq.quantization.gptq] avg loss: 211.3699188232422
INFO - Quantizing block_sparse_moe.experts.7.w3 in layer 29/32...
2024-04-18 09:50:19 INFO [auto_gptq.modeling._base] Quantizing block_sparse_moe.experts.7.w3 in layer 29/32...
2024-04-18 09:50:20 INFO [auto_gptq.quantization.gptq] duration: 1.2222824096679688
2024-04-18 09:50:20 INFO [auto_gptq.quantization.gptq] avg loss: 83.82467651367188
INFO - Quantizing block_sparse_moe.experts.0.w2 in layer 29/32...
2024-04-18 09:50:38 INFO [auto_gptq.modeling._base] Quantizing block_sparse_moe.experts.0.w2 in layer 29/32...
2024-04-18 09:50:43 INFO [auto_gptq.quantization.gptq] duration: 4.365283012390137
2024-04-18 09:50:43 INFO [auto_gptq.quantization.gptq] avg loss: 19.91876983642578
INFO - Quantizing block_sparse_moe.experts.1.w2 in layer 29/32...
2024-04-18 09:50:43 INFO [auto_gptq.modeling._base] Quantizing block_sparse_moe.experts.1.w2 in layer 29/32...
2024-04-18 09:50:47 INFO [auto_gptq.quantization.gptq] duration: 4.401262521743774
2024-04-18 09:50:47 INFO [auto_gptq.quantization.gptq] avg loss: 6.792891025543213
INFO - Quantizing block_sparse_moe.experts.2.w2 in layer 29/32...
2024-04-18 09:50:47 INFO [auto_gptq.modeling._base] Quantizing block_sparse_moe.experts.2.w2 in layer 29/32...
2024-04-18 09:50:52 INFO [auto_gptq.quantization.gptq] duration: 4.381655931472778
2024-04-18 09:50:52 INFO [auto_gptq.quantization.gptq] avg loss: 36.049583435058594
INFO - Quantizing block_sparse_moe.experts.3.w2 in layer 29/32...
2024-04-18 09:50:52 INFO [auto_gptq.modeling._base] Quantizing block_sparse_moe.experts.3.w2 in layer 29/32...
2024-04-18 09:50:56 INFO [auto_gptq.quantization.gptq] duration: 4.5015199184417725
2024-04-18 09:50:56 INFO [auto_gptq.quantization.gptq] avg loss: 13.600162506103516
INFO - Quantizing block_sparse_moe.experts.4.w2 in layer 29/32...
2024-04-18 09:50:56 INFO [auto_gptq.modeling._base] Quantizing block_sparse_moe.experts.4.w2 in layer 29/32...
2024-04-18 09:51:00 INFO [auto_gptq.quantization.gptq] duration: 4.375776290893555
2024-04-18 09:51:00 INFO [auto_gptq.quantization.gptq] avg loss: 2.8602569103240967
INFO - Quantizing block_sparse_moe.experts.5.w2 in layer 29/32...
2024-04-18 09:51:00 INFO [auto_gptq.modeling._base] Quantizing block_sparse_moe.experts.5.w2 in layer 29/32...
2024-04-18 09:51:05 INFO [auto_gptq.quantization.gptq] duration: 4.481191635131836
2024-04-18 09:51:05 INFO [auto_gptq.quantization.gptq] avg loss: 53.12783432006836
INFO - Quantizing block_sparse_moe.experts.6.w2 in layer 29/32...
2024-04-18 09:51:05 INFO [auto_gptq.modeling._base] Quantizing block_sparse_moe.experts.6.w2 in layer 29/32...
2024-04-18 09:51:10 INFO [auto_gptq.quantization.gptq] duration: 4.568590402603149
2024-04-18 09:51:10 INFO [auto_gptq.quantization.gptq] avg loss: 73.41600036621094
INFO - Quantizing block_sparse_moe.experts.7.w2 in layer 29/32...
2024-04-18 09:51:10 INFO [auto_gptq.modeling._base] Quantizing block_sparse_moe.experts.7.w2 in layer 29/32...
2024-04-18 09:51:14 INFO [auto_gptq.quantization.gptq] duration: 4.3780763149261475
2024-04-18 09:51:14 INFO [auto_gptq.quantization.gptq] avg loss: 27.395843505859375

Make sure you have proper dataset that is similar to the training dataset.

@moseshu Have you figure this problem out?

  1. From my experience, moe models are much harder to quant due to gate/routers.

  2. You need good dataset, as close to originals training as possible with high enough of nsamples. 128 for every 7B is my rule.

  3. Later layers have always had much higher losses than earlier layers.

Based on what you posted, can't tell if quant is good or not. Do ppl post quant and compare running avg loss of early vs later layers.