Error running new model openchat-3.5-0106-gemma on RTX 4090 24GB machine, works with older mistral based model

Question

Error running new model openchat-3.5-0106-gemma on RTX 4090 24GB machine, works with older mistral based model

Closed this issue 2 months ago · 2 comments

Error when running new gemma model on RTX 4090 24 GB GPU. Following is the command using vllm with ray and the error. Same command works well with the older mistral based model openchat-3.5-0106


python -m ochat.serving.openai_api_server --model ~/projects/llm_models/openchat/openchat-3.5-0106-gemma/ --engine-use-ray --worker-use-ray --max-model-len 8000 --tensor-parallel-size 1 --host 0.0.0.0 --disable-log-requests --disable-log-stats --log-file openchat.log

actor_id=a01a4c662fc834d5e0d0230f01000000, repr=<vllm.engine.async_llm_engine._AsyncLLMEngine object at 0x7f8ec768d110>)
File "/home/ubuntu/.pyenv/versions/3.11.2/lib/python3.11/concurrent/futures/_base.py", line 456, in result                                    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^                                                                         
File "/home/ubuntu/.pyenv/versions/3.11.2/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
           ^^^^^^^^^^^^^^^^^^^^^                                                                                                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                     
File "/home/ubuntu/projects/openchat/.venv/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 131, in __init__                     self._init_cache()                                                                                                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                     
File "/home/ubuntu/projects/openchat/.venv/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 366, in _init_cache                  raise ValueError(                                                                                                                        
ValueError: The model's max seq len (8192) is larger than the maximum number of tokens that can be stored in KV cache (5888). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine

Answer 1 · 2024-03-17T02:22:14.000Z

Gemma takes up much more VRAM as the embedding is very large. Can you try tuning the --gpu-memory-utilization, like setting it to 0.95 or more?

Answer 2 · 2024-03-25T03:12:42.000Z

I am able to run it by increasing the GPU in VRAM with command:

python -m ochat.serving.openai_api_server --model ~/projects/llm_models/openchat/openchat-3.5-0106-gemma/ --engine-use-ray --worker-use-ray --max-model-len 8192 --tensor-parallel-size 1 --gpu-memory-utilization .95 --host 0.0.0.0 --api-keys sk-try-openchat --disable-log-requests --disable-log-stats --log-file openchat.log

It does not work on 4090 24GB RAM if gpu-memory-utilization is not defined to be 0.95 or higher.