xusenlinzy/api-for-open-llm

Qwen-14B-Chat-Int4 加载报错

Jasonsey opened this issue · 1 comments

提交前必须检查以下项目 | The following items must be checked before submission

  • 请确保使用的是仓库最新代码(git pull),一些问题已被解决和修复。 | Make sure you are using the latest code from the repository (git pull), some issues have already been addressed and fixed.
  • 我已阅读项目文档FAQ章节并且已在Issue中对问题进行了搜索,没有找到相似问题和解决方案 | I have searched the existing issues / discussions

问题类型 | Type of problem

模型推理和部署 | Model inference and deployment

操作系统 | Operating system

Linux

详细描述问题 | Detailed description of the problem

启动Qwen-14B-Chat-Int4报错,报错内容:

Traceback (most recent call last):
  File "/home/notebook/code/personal/IntentEngine/tmp/api-for-open-llm/server.py", line 2, in <module>
    from api.models import EMBEDDED_MODEL, GENERATE_MDDEL, app, VLLM_ENGINE
  File "/home/notebook/code/personal/IntentEngine/tmp/api-for-open-llm/api/models.py", line 135, in <module>
    GENERATE_MDDEL = create_generate_model() if (not config.USE_VLLM and config.ACTIVATE_INFERENCE) else None
  File "/home/notebook/code/personal/IntentEngine/tmp/api-for-open-llm/api/models.py", line 43, in create_generate_model
    model, tokenizer = load_model(
  File "/home/notebook/code/personal/IntentEngine/tmp/api-for-open-llm/api/apapter/model.py", line 235, in load_model
    model, tokenizer = adapter.load_model(
  File "/home/notebook/code/personal/IntentEngine/tmp/api-for-open-llm/api/apapter/model.py", line 107, in load_model
    model = self.model_class.from_pretrained(
  File "/opt/conda/envs/py310/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 558, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/conda/envs/py310/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3257, in from_pretrained
    model = quantizer.post_init_model(model)
  File "/opt/conda/envs/py310/lib/python3.10/site-packages/optimum/gptq/quantizer.py", line 482, in post_init_model
    raise ValueError(
ValueError: Found modules on cpu/disk. Using Exllama backend requires all the modules to be on GPU.You can deactivate exllama backend by setting `disable_exllama=True` in the quantization config object

初步分析,是使用了GPTQ的模块导致的,需要做兼容

Dependencies

# 请在此处粘贴依赖情况
# Please paste the dependencies here

运行日志或截图 | Runtime logs or screenshots

# 请在此处粘贴运行日志
# Please paste the run log here