ValueError: The model's max seq len (163840) is larger than the maximum number of tokens that can be stored in KV cache (13360). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.
ArtificialZeng opened this issue · 3 comments
Traceback (most recent call last):
[rank0]: File "/home/apus/mambaforge/envs/vllm_deepseekv2/lib/python3.10/site-packages/swift/cli/deploy.py", line 5, in
[rank0]: deploy_main()
[rank0]: File "/home/apus/mambaforge/envs/vllm_deepseekv2/lib/python3.10/site-packages/swift/utils/run_utils.py", line 32, in x_main
[rank0]: result = llm_x(args, **kwargs)
[rank0]: File "/home/apus/mambaforge/envs/vllm_deepseekv2/lib/python3.10/site-packages/swift/llm/deploy.py", line 773, in llm_deploy
[rank0]: llm_engine, template = prepare_vllm_engine_template(args, use_async=True)
[rank0]: File "/home/apus/mambaforge/envs/vllm_deepseekv2/lib/python3.10/site-packages/swift/llm/utils/vllm_utils.py", line 542, in prepare_vllm_engine_template
[rank0]: llm_engine = get_vllm_engine(
[rank0]: File "/home/apus/mambaforge/envs/vllm_deepseekv2/lib/python3.10/site-packages/swift/llm/utils/vllm_utils.py", line 116, in get_vllm_engine
[rank0]: llm_engine = llm_engine_cls.from_engine_args(engine_args)
[rank0]: File "/home/apus/mambaforge/envs/vllm_deepseekv2/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/home/apus/mambaforge/envs/vllm_deepseekv2/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 381, in init
[rank0]: self.engine = self._init_engine(*args, **kwargs)
[rank0]: File "/home/apus/mambaforge/envs/vllm_deepseekv2/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine
[rank0]: return engine_class(*args, **kwargs)
[rank0]: File "/home/apus/mambaforge/envs/vllm_deepseekv2/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 263, in init
[rank0]: self._initialize_kv_caches()
[rank0]: File "/home/apus/mambaforge/envs/vllm_deepseekv2/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 375, in _initialize_kv_caches
[rank0]: self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]: File "/home/apus/mambaforge/envs/vllm_deepseekv2/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 62, in initialize_cache
[rank0]: self._run_workers("initialize_cache",
[rank0]: File "/home/apus/mambaforge/envs/vllm_deepseekv2/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers
[rank0]: driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]: File "/home/apus/mambaforge/envs/vllm_deepseekv2/lib/python3.10/site-packages/vllm/worker/worker.py", line 214, in initialize_cache
[rank0]: raise_if_cache_size_invalid(num_gpu_blocks,
[rank0]: File "/home/apus/mambaforge/envs/vllm_deepseekv2/lib/python3.10/site-packages/vllm/worker/worker.py", line 374, in raise_if_cache_size_invalid
[rank0]: raise ValueError(
[rank0]: ValueError: The model's max seq len (163840) is larger than the maximum number of tokens that can be stored in KV cache (13360). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.
ERROR 08-15 17:01:04 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 3015877 died, exit code: -15
ERROR 08-15 17:01:04 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 3015878 died, exit code: -15
ERROR 08-15 17:01:04 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 3015880 died, exit code: -15
INFO 08-15 17:01:04 multiproc_worker_utils.py:123] Killing local vLLM worker processes
/home/apus/mambaforge/envs/vllm_deepseekv2/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 7 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/apus/mambaforge/envs/vllm_deepseekv2/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Have same question
Have same question
Hi @ArtificialZeng @EdWangLoDaSc - this is a vllm issue. You need to pass --max-model-len less than your KV cache size (e.g. 13360 specified in the title of this ticket).
This worked for me -
python -m vllm.entrypoints.openai.api_server --trust-remote-code --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --port 9000 --host 0.0.0.0 --max-model-len 80000
Please see vllm-project/vllm#2418 for more details