failed to save quantizationed model
LiMa-cas opened this issue · 17 comments
save:
save_trans: True
save_lightllm : False
save_fake: False
save_path: /extra_data/mali36/llmc/models/
when I used the above config, I get a 16G model,
when I used the following config, I get a 29G model, but the AWQ-model is 5G, could u help me ?
save_trans: False
save_lightllm : True
save_fake: False
save_path: /extra_data/mali36/llmc/models/
save_trans
save: save_trans: True save_lightllm : False save_fake: False save_path: /extra_data/mali36/llmc/models/
when I used the above config, I get a 16G model, when I used the following config, I get a 29G model, but the AWQ-model is 5G, could u help me ?
save_trans: False save_lightllm : True save_fake: False save_path: /extra_data/mali36/llmc/models/
The model size saved by save_trans should be the same as the original model size. We are still troubleshooting save_lighllm, so you can first try save_vllm:True and use the vllm engine for inference. You can refer to this document: https://llmc-en.readthedocs.io/en/latest/backend/vllm.html
Hi,I still get the 29G models:
base:
seed: &seed 42
model:
type: Llama
path: /home/test1/workspace/llama3-8B-instruct
tokenizer_mode: slow
torch_dtype: auto
calib:
name: pileval
download: True
n_samples: 128
bs: -1
seq_len: 512
preproc: pileval_awq
seed: *seed
eval:
eval_pos: [pretrain, transformed, fake_quant]
name: wikitext2
download: True
seq_len: 1024
# For 7B / 13B model eval, bs can be set to "1", and inference_per_block can be set to "False".
# For 70B model eval, bs can be set to "20", and inference_per_block can be set to "True".
bs: 1
inference_per_block: False
# Consistency of tokens between original and fake-quantized model output.
eval_token_consist: True
quant:
method: Awq
weight:
bit: 4
symmetric: True
granularity: per_group
group_size: 128
calib_algo: learnable
special:
trans: True
# The options for "trans_version" include "v1" and "v2".
# But their results don't differ significantly.
trans_version: v2
weight_clip: True
clip_version: v2
# For 2-bit quantization, setting "clip_sym: False" will yield better results.
clip_sym: True
quant_out: True
save:
save_vllm : True
save_path: /home/test1/workspace/llmc/models/
Hi,I still get the 29G models:
base: seed: &seed 42 model: type: Llama path: /home/test1/workspace/llama3-8B-instruct tokenizer_mode: slow torch_dtype: auto calib: name: pileval download: True n_samples: 128 bs: -1 seq_len: 512 preproc: pileval_awq seed: *seed eval: eval_pos: [pretrain, transformed, fake_quant] name: wikitext2 download: True seq_len: 1024 # For 7B / 13B model eval, bs can be set to "1", and inference_per_block can be set to "False". # For 70B model eval, bs can be set to "20", and inference_per_block can be set to "True". bs: 1 inference_per_block: False # Consistency of tokens between original and fake-quantized model output. eval_token_consist: True quant: method: Awq weight: bit: 4 symmetric: True granularity: per_group group_size: 128 calib_algo: learnable special: trans: True # The options for "trans_version" include "v1" and "v2". # But their results don't differ significantly. trans_version: v2 weight_clip: True clip_version: v2 # For 2-bit quantization, setting "clip_sym: False" will yield better results. clip_sym: True quant_out: True save: save_vllm : True save_path: /home/test1/workspace/llmc/models/
This is very strange. Can you show the 'config.json' of the model you saved, as well as your command to count the size of the model?
Or try this command: du -sh /home/test1/workspace/llmc/models/vllm_quant_model
config.jason:
{
"_name_or_path": "/home/test1/workspace/mali/llama3-8B-instruct",
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 128000,
"eos_token_id": 128001,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 14336,
"max_position_embeddings": 8192,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 8,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 500000.0,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.44.2",
"use_cache": false,
"vocab_size": 128256,
"compression_config": {
"config_groups": {
"group_0": {
"targets": [
"Linear"
],
"input_activations": null,
"weights": {
"dynamic": false,
"group_size": 128,
"num_bits": 4,
"observer": "minmax",
"observer_kwargs": {},
"strategy": "group",
"symmetric": true,
"type": "int"
}
}
},
"format": "int-quantized",
"ignore": [
"lm_head"
],
"quant_method": "compressed-tensors"
}
}
We tried the latest version of llmc, and the size of the save_vllm model we obtained is still 5.4G, and there are only two safetensors. Why don’t you update the version and run it again? It is best to store the model in a new directory
could u show me your yml?
"need_pack" should be specified as True.
Hi,after I got the quantized model,How could I evaluate it?
To evaluate accuracy, you can use lm_eval:
lm_eval --model vllm
--model_args pretrained=".//home/test1/workspace/llmc/models/vllm_quant_model",add_bos_token=true
--tasks gsm8k
--num_fewshot 5
--limit 250
--batch_size 'auto'
thanks very much!!!
"need_pack" should be specified as True.
where is to set need_pack?