Running FastChat on GPTQ (and quantized) models
Opened this issue · 0 comments
Hi team,
I've a question related to generating model responses using GPTQ.
I've compressed Llama-2-7B using basic AutoGPTQ using transformers.
from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.gptq import GPTQQuantizer, load_quantized_model
import torch
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
quantizer = GPTQQuantizer(bits=4, dataset="c4", model_seqlen = 2048)
quantized_model = quantizer.quantize_model(model, tokenizer)
save_folder = "directory/"
quantizer.save(model, save_folder)
tokenizer.save_pretrained(save_folder)
Once the model is saved, I am trying to generate the answers using the following command
python gen_model_answer.py --model-path directory/ --model-id llama-2-7b-gptq-4
But this throws the following error -
File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/Compress_Align/FastChat/fastchat/llm_judge/gen_model_answer.py", line 103, in get_model_answers
model, tokenizer = load_model(
File "/home/ubuntu/Compress_Align/FastChat/fastchat/model/model_adapter.py", line 379, in load_model
model, tokenizer = adapter.load_model(model_path, kwargs)
File "/home/ubuntu/Compress_Align/FastChat/fastchat/model/model_adapter.py", line 124, in load_model
model = AutoModelForCausalLM.from_pretrained(
File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
return model_class.from_pretrained(
File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3823, in from_pretrained
hf_quantizer.postprocess_model(model)
File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/transformers/quantizers/base.py", line 195, in postprocess_model
return self._process_model_after_weight_loading(model, **kwargs)
File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/transformers/quantizers/quantizer_gptq.py", line 80, in _process_model_after_weight_loading
model = self.optimum_quantizer.post_init_model(model)
File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/optimum/gptq/quantizer.py", line 595, in post_init_model
raise ValueError(
ValueError: Found modules on cpu/disk. Using Exllama or Exllamav2 backend requires all the modules to be on GPU.You can deactivate exllama backend by setting `disable_exllama=True` in the quantization config object
I was able to pass device map as cuda in the load_model()
function of BaseModelAdapter
class but the inference is tooo slow (I am assuming the model is loaded in GPU but the computations are being done in CPU which is not the one I expected to be)
model = AutoModelForCausalLM.from_pretrained(
model_path,
low_cpu_mem_usage=True,
trust_remote_code=True,
# device_map='cuda', # TODO: Change made to support quantized model instead of disabling exllama!
**from_pretrained_kwargs,
)
Is there a way to generate answers and evaluate the quantized models the same way as this step - https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/README.md. Or am I missing something fundamental?
Tagging relevant issues
The main fix suggested is to disable exllama but that increases the inference speed a lot!