failed to save quantizationed model

save:
    save_trans: True
    save_lightllm : False
    save_fake: False
    save_path: /extra_data/mali36/llmc/models/

when I used the above config, I get a 16G model,
when I used the following config, I get a 29G model, but the AWQ-model is 5G, could u help me ?

    save_trans: False
    save_lightllm : True
    save_fake: False
    save_path: /extra_data/mali36/llmc/models/

save_trans

save:
    save_trans: True
    save_lightllm : False
    save_fake: False
    save_path: /extra_data/mali36/llmc/models/
when I used the above config, I get a 16G model, when I used the following config, I get a 29G model, but the AWQ-model is 5G, could u help me ?
    save_trans: False
    save_lightllm : True
    save_fake: False
    save_path: /extra_data/mali36/llmc/models/

The model size saved by save_trans should be the same as the original model size. We are still troubleshooting save_lighllm, so you can first try save_vllm:True and use the vllm engine for inference. You can refer to this document: https://llmc-en.readthedocs.io/en/latest/backend/vllm.html

Hi，I still get the 29G models：

base:
seed: &seed 42
model:
type: Llama
path: /home/test1/workspace/llama3-8B-instruct
tokenizer_mode: slow
torch_dtype: auto
calib:
name: pileval
download: True
n_samples: 128
bs: -1
seq_len: 512
preproc: pileval_awq
seed: *seed
eval:
eval_pos: [pretrain, transformed, fake_quant]
name: wikitext2
download: True
seq_len: 1024
# For 7B / 13B model eval, bs can be set to "1", and inference_per_block can be set to "False".
# For 70B model eval, bs can be set to "20", and inference_per_block can be set to "True".
bs: 1
inference_per_block: False
# Consistency of tokens between original and fake-quantized model output.
eval_token_consist: True
quant:
method: Awq
weight:
bit: 4
symmetric: True
granularity: per_group
group_size: 128
calib_algo: learnable
special:
trans: True
# The options for "trans_version" include "v1" and "v2".
# But their results don't differ significantly.
trans_version: v2
weight_clip: True
clip_version: v2
# For 2-bit quantization, setting "clip_sym: False" will yield better results.
clip_sym: True
quant_out: True
save:
save_vllm : True
save_path: /home/test1/workspace/llmc/models/

Hi，I still get the 29G models：

base: seed: &seed 42 model: type: Llama path: /home/test1/workspace/llama3-8B-instruct tokenizer_mode: slow torch_dtype: auto calib: name: pileval download: True n_samples: 128 bs: -1 seq_len: 512 preproc: pileval_awq seed: *seed eval: eval_pos: [pretrain, transformed, fake_quant] name: wikitext2 download: True seq_len: 1024 # For 7B / 13B model eval, bs can be set to "1", and inference_per_block can be set to "False". # For 70B model eval, bs can be set to "20", and inference_per_block can be set to "True". bs: 1 inference_per_block: False # Consistency of tokens between original and fake-quantized model output. eval_token_consist: True quant: method: Awq weight: bit: 4 symmetric: True granularity: per_group group_size: 128 calib_algo: learnable special: trans: True # The options for "trans_version" include "v1" and "v2". # But their results don't differ significantly. trans_version: v2 weight_clip: True clip_version: v2 # For 2-bit quantization, setting "clip_sym: False" will yield better results. clip_sym: True quant_out: True save: save_vllm : True save_path: /home/test1/workspace/llmc/models/

This is very strange. Can you show the 'config.json' of the model you saved, as well as your command to count the size of the model?

Or try this command: du -sh /home/test1/workspace/llmc/models/vllm_quant_model

config.jason:
{
"_name_or_path": "/home/test1/workspace/mali/llama3-8B-instruct",
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 128000,
"eos_token_id": 128001,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 14336,
"max_position_embeddings": 8192,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 8,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 500000.0,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.44.2",
"use_cache": false,
"vocab_size": 128256,
"compression_config": {
"config_groups": {
"group_0": {
"targets": [
"Linear"
],
"input_activations": null,
"weights": {
"dynamic": false,
"group_size": 128,
"num_bits": 4,
"observer": "minmax",
"observer_kwargs": {},
"strategy": "group",
"symmetric": true,
"type": "int"
}
}
},
"format": "int-quantized",
"ignore": [
"lm_head"
],
"quant_method": "compressed-tensors"
}
}

We tried the latest version of llmc, and the size of the save_vllm model we obtained is still 5.4G, and there are only two safetensors. Why don’t you update the version and run it again? It is best to store the model in a new directory

could u show me your yml？

https://github.com/ModelTC/llmc/blob/main/configs/quantization/backend/vllm/awq_w4a16.yml

could I change the yml like this for awq_w4a16.yml？ add calib_algo: learnable && clip_version: v2 && clip_sym: True

"need_pack" should be specified as True.

Hi，after I got the quantized model，How could I evaluate it？

To evaluate accuracy, you can use lm_eval：
lm_eval --model vllm
--model_args pretrained=".//home/test1/workspace/llmc/models/vllm_quant_model",add_bos_token=true
--tasks gsm8k
--num_fewshot 5
--limit 250
--batch_size 'auto'

thanks very much！！！

Hi ，I get the awq transformed_model using step_1_awq.yml, but when I used the step_2_omimiq,there are wrong things:

Hi ，I get the awq transformed_model using step_1_awq.yml, but when I used the step_2_omimiq,there are wrong things:

please check your 'awq trans' model path

"need_pack" should be specified as True.

where is to set need_pack?

llmc/configs/quantization/backend/vllm/awq_w4a16.yml

Line 34 in eb66ad6

need_pack: True