use error

Question

use error

zky001 opened this issue 10 months ago · 1 comments

The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (1792). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine

Answer 1 · 2024-02-27T03:51:11.000Z

Hi, thanks for running our codes, it looks like you are encountering an issue with vllm. You could refer to vllm-project/vllm#2418 to try the solution mentioned there. Since the vllm running may depend on cuda version and torch version, I cannot determine the solution for your case. If you still encounter issues with vllm, you may turn to hugging face inference instead.