vllm gptq or awtq
Closed this issue ยท 10 comments
I need to run either a AWTQ or GPTQ version of fine tuned llama-7b model. I am struggling to do so.
My models:
Fine tuned llama 7b GPTQ model: rshrott/description-together-ai-4bit
Fine tuned llama 7b AWQ model: rshrott/description-awq-4bit
What I tried:
Has anyone got this branch to work? Struggling.
branch: https://github.com/chu-tianxiang/vllm-gptq
pip install git+https://github.com/chu-tianxiang/vllm-gptq.git
from vllm import LLM, SamplingParams
llm = LLM(model="rshrott/description-together-ai-4bit")
Error:
INFO 09-15 13:17:17 llm_engine.py:70] Initializing an LLM engine with config: model='rshrott/description-together-ai-4bit', tokenizer='rshrott/description-together-ai-4bit', tokenizer_mode=auto, trust_remote_code=False, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)
KeyError Traceback (most recent call last)
in <cell line: 3>()
1 #llm = LLM(model="TheBloke/Llama-2-7b-Chat-GPTQ")
2
----> 3 llm = LLM(model="rshrott/description-together-ai-4bit")
7 frames
/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py in load_weights(self, model_name_or_path, cache_dir, use_np_cache)
310 if weight_name not in name:
311 continue
--> 312 param = state_dict[name.replace(weight_name, "qkv_proj")]
313 if "g_idx" in name:
314 param.data.copy_(loaded_weight)
KeyError: 'model.layers.0.self_attn.qkv_proj.qweight'
Does anyone have some working example in a google colab notebook to follow?
Hi @ryanshrott, thanks for trying out vLLM. We now support AWQ. GPTQ support is in progress.
- Install vLLM from source by running:
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .
- Add
quantization="awq"
when initializing your AWQ model. For example,
model = LLM("casperhansen/vicuna-7b-v1.5-awq", quantization="awq")
This should just work.
Please note that we assume 1) the model is already quantized, 2) the model directory contains a config file (e.g., quantize_config.json
) for the quantization parameters, and 3) AWQ is only supported for Ampere and newer GPUs.
Hi @ryanshrott, thanks for trying out vLLM. We now support AWQ. GPTQ support is in progress.
- Install vLLM from source by running:
git clone https://github.com/vllm-project/vllm.git cd vllm pip install -e .
- Add
quantization="awq"
when initializing your AWQ model. For example,model = LLM("casperhansen/vicuna-7b-v1.5-awq", quantization="awq")
This should just work.
Please note that we assume 1) the model is already quantized, 2) the model directory contains a config file (e.g.,
quantize_config.json
) for the quantization parameters, and 3) AWQ is only supported for Ampere and newer GPUs.
Exciting. I will try it out tomorrow!
@WoosukKwon I tried to run the following:
pip install git+https://github.com/vllm-project/vllm.git
But get this error in google colab:
Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=2.0.0->vllm==0.1.7) (1.3.0)
Building wheels for collected packages: vllm
error: subprocess-exited-with-error
ร Building wheel for vllm (pyproject.toml) did not run successfully.
โ exit code: 1
โฐโ> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip.
Building wheel for vllm (pyproject.toml) ... error
ERROR: Failed building wheel for vllm
Failed to build vllm
ERROR: Could not build wheels for vllm, which is required to install pyproject.toml-based projects
@WoosukKwon I got it working, but finding the quantized models to run extremely slow, like > 1000 times slower. Any idea?
I also got out of memory error where I don't get the error if I use HF transformers.
vllm-gptq is an unofficial branch developed by me. Please run pip install git+https://github.com/chu-tianxiang/vllm-gptq@gptq_hf
to install. I tested with the rshrott/description-together-ai-4bit
with no problem.
Still, It's preferred to use AWQ for now before official integration of GPTQ into vLLM
Edit: this branch also requires optimum and latest autogptq installed. pip install optimum git+https://github.com/PanQiWei/AutoGPTQ
@chu-tianxiang i'll give it a shot. The AWQ integration support seems to be making my model way slower, so not sure why.
@chu-tianxiang I got it working, but there also seems to be a bit of a performance issue? The iterations per second are about 20 times slower than non-quantized model, although I was using an RTX 3090 before and now I am on RTX 3080. I am also running on a WSL machine here, but I'm still surprised by this performance loss. I see that I get this warning on WSL as well:
WARNING 09-18 10:49:45 cache_engine.py:96] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
Any idea why my performance is so much lower? I'm used to getting 2 sec/it. Now I get 15 secs/it!
After more benchmarking, I get 18 secs/iteration with GPTQ. I was previously getting 1.5 sec/iteration without GPTQ.
After more benchmarking, I get 18 secs/iteration with GPTQ. I was previously getting 1.5 sec/iteration without GPTQ.
Hi @ryanshrott , I'm trying explore running vllm using a GPTQ model and this result looks exciting