Error running new model openchat-3.5-0106-gemma on RTX 4090 24GB machine, works with older mistral based model
Closed this issue · 2 comments
vikrantrathore commented
Error when running new gemma model on RTX 4090 24 GB GPU. Following is the command using vllm with ray and the error. Same command works well with the older mistral based model openchat-3.5-0106
python -m ochat.serving.openai_api_server --model ~/projects/llm_models/openchat/openchat-3.5-0106-gemma/ --engine-use-ray --worker-use-ray --max-model-len 8000 --tensor-parallel-size 1 --host 0.0.0.0 --disable-log-requests --disable-log-stats --log-file openchat.log
actor_id=a01a4c662fc834d5e0d0230f01000000, repr=<vllm.engine.async_llm_engine._AsyncLLMEngine object at 0x7f8ec768d110>)
File "/home/ubuntu/.pyenv/versions/3.11.2/lib/python3.11/concurrent/futures/_base.py", line 456, in result return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.pyenv/versions/3.11.2/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/projects/openchat/.venv/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 131, in __init__ self._init_cache() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/projects/openchat/.venv/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 366, in _init_cache raise ValueError(
ValueError: The model's max seq len (8192) is larger than the maximum number of tokens that can be stored in KV cache (5888). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine
imoneoi commented
Gemma takes up much more VRAM as the embedding is very large. Can you try tuning the --gpu-memory-utilization
, like setting it to 0.95
or more?
vikrantrathore commented
I am able to run it by increasing the GPU in VRAM with command:
python -m ochat.serving.openai_api_server --model ~/projects/llm_models/openchat/openchat-3.5-0106-gemma/ --engine-use-ray --worker-use-ray --max-model-len 8192 --tensor-parallel-size 1 --gpu-memory-utilization .95 --host 0.0.0.0 --api-keys sk-try-openchat --disable-log-requests --disable-log-stats --log-file openchat.log
It does not work on 4090 24GB RAM if gpu-memory-utilization is not defined to be 0.95 or higher.