xusenlinzy/api-for-open-llm

请问是否支持https://github.com/PanQiWei/AutoGPTQ 量化后的qwen-7b模型

wangschang opened this issue · 5 comments

提交前必须检查以下项目 | The following items must be checked before submission

  • 请确保使用的是仓库最新代码(git pull),一些问题已被解决和修复。 | Make sure you are using the latest code from the repository (git pull), some issues have already been addressed and fixed.
  • 我已阅读项目文档FAQ章节并且已在Issue中对问题进行了搜索,没有找到相似问题和解决方案 | I have searched the existing issues / discussions

问题类型 | Type of problem

None

操作系统 | Operating system

None

详细描述问题 | Detailed description of the problem

# 请在此处粘贴运行代码(如没有可删除该代码块)
# Paste the runtime code here (delete the code block if you don't have it)

Dependencies

# 请在此处粘贴依赖情况
# Please paste the dependencies here

运行日志或截图 | Runtime logs or screenshots

# 请在此处粘贴运行日志
# Please paste the run log here

谢谢 我试试

运行的时候报错

配置:
`PORT=8080

model related

MODEL_NAME=qwen
MODEL_PATH=/root/model/qwen7bint4_1005
PROMPT_NAME=
EMBEDDING_NAME=
DEVICE_MAP=auto

api related

API_PREFIX=/v1

vllm related

USE_VLLM=true
TRUST_REMOTE_CODE=true
TOKENIZE_MODE=slow
TENSOR_PARALLEL_SIZE=1
DTYPE=half`

报错信息:

2023-10-07 10:20:13.784 | DEBUG | api.config:<module>:126 - Config: {'HOST': '0.0.0.0', 'PORT': 8080, 'MODEL_NAME': 'qwen', 'MODEL_PATH': '/root/model/qwen7bint4_1005', 'ADAPTER_MODEL_PATH': None, 'RESIZE_EMBEDDINGS': False, 'DEVICE': 'cuda', 'DEVICE_MAP': 'auto', 'GPUS': '', 'NUM_GPUs': 1, 'EMBEDDING_NAME': None, 'EMBEDDING_SIZE': None, 'EMBEDDING_DEVICE': 'cuda', 'QUANTIZE': 16, 'LOAD_IN_8BIT': False, 'LOAD_IN_4BIT': False, 'USING_PTUNING_V2': False, 'CONTEXT_LEN': None, 'STREAM_INTERVERL': 2, 'PROMPT_NAME': None, 'PATCH_TYPE': None, 'ALPHA': 'auto', 'API_PREFIX': '/v1', 'USE_VLLM': True, 'TRUST_REMOTE_CODE': True, 'TOKENIZE_MODE': 'slow', 'TENSOR_PARALLEL_SIZE': 1, 'DTYPE': 'half', 'GPU_MEMORY_UTILIZATION': 0.9, 'MAX_NUM_BATCHED_TOKENS': 5120, 'MAX_NUM_SEQS': 256, 'USE_STREAMER_V2': False, 'API_KEYS': None, 'ACTIVATE_INFERENCE': True} INFO 10-07 10:20:18 llm_engine.py:70] Initializing an LLM engine with config: model='/root/model/qwen7bint4_1005', tokenizer='/root/model/qwen7bint4_1005', tokenizer_mode=slow, trust_remote_code=True, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0) WARNING 10-07 10:20:19 tokenizer.py:63] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead. Traceback (most recent call last): File "server.py", line 2, in <module> from api.models import EMBEDDED_MODEL, GENERATE_MDDEL, app, VLLM_ENGINE File "/root/test/api-for-open-llm/api/models.py", line 138, in <module> VLLM_ENGINE = create_vllm_engine() if (config.USE_VLLM and config.ACTIVATE_INFERENCE) else None File "/root/test/api-for-open-llm/api/models.py", line 98, in create_vllm_engine engine = AsyncLLMEngine.from_engine_args(engine_args) File "/usr/local/miniconda3/lib/python3.8/site-packages/vllm/engine/async_llm_engine.py", line 240, in from_engine_args engine = cls(engine_args.worker_use_ray, File "/usr/local/miniconda3/lib/python3.8/site-packages/vllm/engine/async_llm_engine.py", line 55, in __init__ self.engine = engine_class(*args, **kwargs) File "/usr/local/miniconda3/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 101, in __init__ self._init_workers(distributed_init_method) File "/usr/local/miniconda3/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 133, in _init_workers self._run_workers( File "/usr/local/miniconda3/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 470, in _run_workers output = executor(*args, **kwargs) File "/usr/local/miniconda3/lib/python3.8/site-packages/vllm/worker/worker.py", line 67, in init_model self.model = get_model(self.model_config) File "/usr/local/miniconda3/lib/python3.8/site-packages/vllm/model_executor/model_loader.py", line 57, in get_model model.load_weights(model_config.model, model_config.download_dir, File "/usr/local/miniconda3/lib/python3.8/site-packages/vllm/model_executor/models/qwen.py", line 281, in load_weights loaded_weight = loaded_weight.view(3, total_num_heads, RuntimeError: shape '[3, 32, 128, 4096]' is invalid for input of size 6291456

vllm is not support gptq vllm-project/vllm#1056

vllm是不支持的