vllm gptq or awtq

Question

vllm gptq or awtq

Closed this issue 7 months ago · 10 comments

I need to run either a AWTQ or GPTQ version of fine tuned llama-7b model. I am struggling to do so.

My models:
Fine tuned llama 7b GPTQ model: rshrott/description-together-ai-4bit
Fine tuned llama 7b AWQ model: rshrott/description-awq-4bit

What I tried:
Has anyone got this branch to work? Struggling.
branch: https://github.com/chu-tianxiang/vllm-gptq

pip install git+https://github.com/chu-tianxiang/vllm-gptq.git
from vllm import LLM, SamplingParams

llm = LLM(model="rshrott/description-together-ai-4bit")

Error:
INFO 09-15 13:17:17 llm_engine.py:70] Initializing an LLM engine with config: model='rshrott/description-together-ai-4bit', tokenizer='rshrott/description-together-ai-4bit', tokenizer_mode=auto, trust_remote_code=False, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)

KeyError Traceback (most recent call last)
in <cell line: 3>()
1 #llm = LLM(model="TheBloke/Llama-2-7b-Chat-GPTQ")
2
----> 3 llm = LLM(model="rshrott/description-together-ai-4bit")

7 frames
/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py in load_weights(self, model_name_or_path, cache_dir, use_np_cache)
310 if weight_name not in name:
311 continue
--> 312 param = state_dict[name.replace(weight_name, "qkv_proj")]
313 if "g_idx" in name:
314 param.data.copy_(loaded_weight)

KeyError: 'model.layers.0.self_attn.qkv_proj.qweight'

Does anyone have some working example in a google colab notebook to follow?

Answer 1 · 2023-09-17T01:04:17.000Z

Hi @ryanshrott, thanks for trying out vLLM. We now support AWQ. GPTQ support is in progress.

Install vLLM from source by running:

git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .

Add quantization="awq" when initializing your AWQ model. For example,

model = LLM("casperhansen/vicuna-7b-v1.5-awq", quantization="awq")

This should just work.

Please note that we assume 1) the model is already quantized, 2) the model directory contains a config file (e.g., quantize_config.json) for the quantization parameters, and 3) AWQ is only supported for Ampere and newer GPUs.

Answer 2 · 2023-09-17T01:09:19.000Z

Hi @ryanshrott, thanks for trying out vLLM. We now support AWQ. GPTQ support is in progress.

Install vLLM from source by running:
git clone https://github.com/vllm-project/vllm.git

cd vllm

pip install -e .
Add quantization="awq" when initializing your AWQ model. For example,
model = LLM("casperhansen/vicuna-7b-v1.5-awq", quantization="awq")
This should just work.

Please note that we assume 1) the model is already quantized, 2) the model directory contains a config file (e.g., quantize_config.json) for the quantization parameters, and 3) AWQ is only supported for Ampere and newer GPUs.

Exciting. I will try it out tomorrow!

Answer 3 · 2023-09-17T19:09:23.000Z

@WoosukKwon I tried to run the following:
pip install git+https://github.com/vllm-project/vllm.git

But get this error in google colab:
Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=2.0.0->vllm==0.1.7) (1.3.0)
Building wheels for collected packages: vllm
error: subprocess-exited-with-error

× Building wheel for vllm (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
Building wheel for vllm (pyproject.toml) ... error
ERROR: Failed building wheel for vllm
Failed to build vllm
ERROR: Could not build wheels for vllm, which is required to install pyproject.toml-based projects

Answer 4 · 2023-09-18T00:11:48.000Z

@WoosukKwon I got it working, but finding the quantized models to run extremely slow, like > 1000 times slower. Any idea?

Answer 5 · 2023-09-18T01:06:07.000Z

I also got out of memory error where I don't get the error if I use HF transformers.

Answer 6 · 2023-09-18T05:35:21.000Z

vllm-gptq is an unofficial branch developed by me. Please run pip install git+https://github.com/chu-tianxiang/vllm-gptq@gptq_hf to install. I tested with the rshrott/description-together-ai-4bit with no problem.
Still, It's preferred to use AWQ for now before official integration of GPTQ into vLLM

Edit: this branch also requires optimum and latest autogptq installed. pip install optimum git+https://github.com/PanQiWei/AutoGPTQ

Answer 7 · 2023-09-18T11:48:43.000Z

@chu-tianxiang i'll give it a shot. The AWQ integration support seems to be making my model way slower, so not sure why.

Answer 8 · 2023-09-18T14:54:49.000Z

@chu-tianxiang I got it working, but there also seems to be a bit of a performance issue? The iterations per second are about 20 times slower than non-quantized model, although I was using an RTX 3090 before and now I am on RTX 3080. I am also running on a WSL machine here, but I'm still surprised by this performance loss. I see that I get this warning on WSL as well:
WARNING 09-18 10:49:45 cache_engine.py:96] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.

Any idea why my performance is so much lower? I'm used to getting 2 sec/it. Now I get 15 secs/it!

Answer 9 · 2023-09-18T16:44:27.000Z

After more benchmarking, I get 18 secs/iteration with GPTQ. I was previously getting 1.5 sec/iteration without GPTQ.

Answer 10 · 2023-09-26T05:09:42.000Z

After more benchmarking, I get 18 secs/iteration with GPTQ. I was previously getting 1.5 sec/iteration without GPTQ.

Hi @ryanshrott , I'm trying explore running vllm using a GPTQ model and this result looks exciting