ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (15248). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

Question

ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (15248). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

guiniao opened this issue 8 months ago · 1 comments

guiniao commented 8 months ago

提交前必须检查以下项目 | The following items must be checked before submission

请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。 | Make sure you are using the latest code from the repository (git pull), some issues have already been addressed and fixed.
我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案 | I have searched the existing issues / discussions

问题类型 | Type of problem

启动命令 | Startup command

操作系统 | Operating system

Linux

详细描述问题 | Detailed description of the problem

使用vllm方式部署api，使用的模型是qwen1.5-14b-chat
.env配置如下：

PORT=8000

model related

MODEL_NAME=qwen
MODEL_PATH=./models/qwen-1.5-14b-chat
PROMPT_NAME=
EMBEDDING_NAME=

device related

GPU设备并行化策略

DEVICE_MAP=auto

GPU数量

NUM_GPUs=2

api related

API_PREFIX=/v1

vllm related

ENGINE=vllm
TRUST_REMOTE_CODE=true
TOKENIZE_MODE=slow
TENSOR_PARALLEL_SIZE=1

开启半精度，可以加快运行速度、减少GPU占用

DTYPE=half

API_KEY，此处随意填一个字符串即可

OPENAI_API_KEY=

启动
python server.py时报错
如下：
ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (15248). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.

看了一些处理方案python server.py --max-model-len 24320，也不起作用

另外我设置了NUM_GPUs=2，显示还是只用了一张卡

Dependencies

# 请在此处粘贴依赖情况
# Please paste the dependencies here

peft 0.10.0
sentence-transformers 2.6.1
torch 2.1.2
transformers 4.39.3
transformers-stream-generator 0.0.5

运行日志或截图 | Runtime logs or screenshots

# 请在此处粘贴运行日志
# Please paste the run log here

Answer 1 · 2024-04-16T07:48:10.000Z

更新一下项目代码，然后更改下面的配置

TENSOR_PARALLEL_SIZE=2 # GPU数量
MODEL_NAME=qwen2
PROMPT_NAME=qwen2

如果显存不够启动32k的话，可以设置CONTEXT_LEN=8192