vllm-project/vllm

vllm gptq or awtq

Closed this issue ยท 10 comments

I need to run either a AWTQ or GPTQ version of fine tuned llama-7b model. I am struggling to do so.

My models:
Fine tuned llama 7b GPTQ model: rshrott/description-together-ai-4bit
Fine tuned llama 7b AWQ model: rshrott/description-awq-4bit

What I tried:
Has anyone got this branch to work? Struggling.
branch: https://github.com/chu-tianxiang/vllm-gptq

pip install git+https://github.com/chu-tianxiang/vllm-gptq.git
from vllm import LLM, SamplingParams

llm = LLM(model="rshrott/description-together-ai-4bit")

Error:
INFO 09-15 13:17:17 llm_engine.py:70] Initializing an LLM engine with config: model='rshrott/description-together-ai-4bit', tokenizer='rshrott/description-together-ai-4bit', tokenizer_mode=auto, trust_remote_code=False, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)

KeyError Traceback (most recent call last)
in <cell line: 3>()
1 #llm = LLM(model="TheBloke/Llama-2-7b-Chat-GPTQ")
2
----> 3 llm = LLM(model="rshrott/description-together-ai-4bit")

7 frames
/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py in load_weights(self, model_name_or_path, cache_dir, use_np_cache)
310 if weight_name not in name:
311 continue
--> 312 param = state_dict[name.replace(weight_name, "qkv_proj")]
313 if "g_idx" in name:
314 param.data.copy_(loaded_weight)

KeyError: 'model.layers.0.self_attn.qkv_proj.qweight'

Does anyone have some working example in a google colab notebook to follow?

Hi @ryanshrott, thanks for trying out vLLM. We now support AWQ. GPTQ support is in progress.

  1. Install vLLM from source by running:
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .
  1. Add quantization="awq" when initializing your AWQ model. For example,
model = LLM("casperhansen/vicuna-7b-v1.5-awq", quantization="awq")

This should just work.

Please note that we assume 1) the model is already quantized, 2) the model directory contains a config file (e.g., quantize_config.json) for the quantization parameters, and 3) AWQ is only supported for Ampere and newer GPUs.

Hi @ryanshrott, thanks for trying out vLLM. We now support AWQ. GPTQ support is in progress.

  1. Install vLLM from source by running:

git clone https://github.com/vllm-project/vllm.git

cd vllm

pip install -e .

  1. Add quantization="awq" when initializing your AWQ model. For example,

model = LLM("casperhansen/vicuna-7b-v1.5-awq", quantization="awq")

This should just work.

Please note that we assume 1) the model is already quantized, 2) the model directory contains a config file (e.g., quantize_config.json) for the quantization parameters, and 3) AWQ is only supported for Ampere and newer GPUs.

Exciting. I will try it out tomorrow!

@WoosukKwon I tried to run the following:
pip install git+https://github.com/vllm-project/vllm.git

But get this error in google colab:
Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=2.0.0->vllm==0.1.7) (1.3.0)
Building wheels for collected packages: vllm
error: subprocess-exited-with-error

ร— Building wheel for vllm (pyproject.toml) did not run successfully.
โ”‚ exit code: 1
โ•ฐโ”€> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
Building wheel for vllm (pyproject.toml) ... error
ERROR: Failed building wheel for vllm
Failed to build vllm
ERROR: Could not build wheels for vllm, which is required to install pyproject.toml-based projects

@WoosukKwon I got it working, but finding the quantized models to run extremely slow, like > 1000 times slower. Any idea?

I also got out of memory error where I don't get the error if I use HF transformers.

vllm-gptq is an unofficial branch developed by me. Please run pip install git+https://github.com/chu-tianxiang/vllm-gptq@gptq_hf to install. I tested with the rshrott/description-together-ai-4bit with no problem.
Still, It's preferred to use AWQ for now before official integration of GPTQ into vLLM

Edit: this branch also requires optimum and latest autogptq installed. pip install optimum git+https://github.com/PanQiWei/AutoGPTQ

@chu-tianxiang i'll give it a shot. The AWQ integration support seems to be making my model way slower, so not sure why.

@chu-tianxiang I got it working, but there also seems to be a bit of a performance issue? The iterations per second are about 20 times slower than non-quantized model, although I was using an RTX 3090 before and now I am on RTX 3080. I am also running on a WSL machine here, but I'm still surprised by this performance loss. I see that I get this warning on WSL as well:
WARNING 09-18 10:49:45 cache_engine.py:96] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.

Any idea why my performance is so much lower? I'm used to getting 2 sec/it. Now I get 15 secs/it!

After more benchmarking, I get 18 secs/iteration with GPTQ. I was previously getting 1.5 sec/iteration without GPTQ.

After more benchmarking, I get 18 secs/iteration with GPTQ. I was previously getting 1.5 sec/iteration without GPTQ.

Hi @ryanshrott , I'm trying explore running vllm using a GPTQ model and this result looks exciting