Running FastChat on GPTQ (and quantized) models

Question

Running FastChat on GPTQ (and quantized) models

Opened this issue 3 days ago · 0 comments

Hi team,

I've a question related to generating model responses using GPTQ.

I've compressed Llama-2-7B using basic AutoGPTQ using transformers.

from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.gptq import GPTQQuantizer, load_quantized_model
import torch
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)

quantizer = GPTQQuantizer(bits=4, dataset="c4", model_seqlen = 2048)
quantized_model = quantizer.quantize_model(model, tokenizer)

save_folder = "directory/"
quantizer.save(model, save_folder)
tokenizer.save_pretrained(save_folder)

Once the model is saved, I am trying to generate the answers using the following command

python gen_model_answer.py --model-path directory/ --model-id llama-2-7b-gptq-4

But this throws the following error -

File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/Compress_Align/FastChat/fastchat/llm_judge/gen_model_answer.py", line 103, in get_model_answers
    model, tokenizer = load_model(
  File "/home/ubuntu/Compress_Align/FastChat/fastchat/model/model_adapter.py", line 379, in load_model
    model, tokenizer = adapter.load_model(model_path, kwargs)
  File "/home/ubuntu/Compress_Align/FastChat/fastchat/model/model_adapter.py", line 124, in load_model
    model = AutoModelForCausalLM.from_pretrained(
  File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
    return model_class.from_pretrained(
  File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3823, in from_pretrained
    hf_quantizer.postprocess_model(model)
  File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/transformers/quantizers/base.py", line 195, in postprocess_model
    return self._process_model_after_weight_loading(model, **kwargs)
  File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/transformers/quantizers/quantizer_gptq.py", line 80, in _process_model_after_weight_loading
    model = self.optimum_quantizer.post_init_model(model)
  File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/optimum/gptq/quantizer.py", line 595, in post_init_model
    raise ValueError(
ValueError: Found modules on cpu/disk. Using Exllama or Exllamav2 backend requires all the modules to be on GPU.You can deactivate exllama backend by setting `disable_exllama=True` in the quantization config object

I was able to pass device map as cuda in the load_model() function of BaseModelAdapter class but the inference is tooo slow (I am assuming the model is loaded in GPU but the computations are being done in CPU which is not the one I expected to be)

model = AutoModelForCausalLM.from_pretrained(
                model_path,
                low_cpu_mem_usage=True,
                trust_remote_code=True,
                # device_map='cuda', # TODO: Change made to support quantized model instead of disabling exllama!
                **from_pretrained_kwargs,
            )

Is there a way to generate answers and evaluate the quantized models the same way as this step - https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/README.md. Or am I missing something fundamental?

Tagging relevant issues

The main fix suggested is to disable exllama but that increases the inference speed a lot!