Unable to specify GPU usage in VLLM code
humza-sami opened this issue ยท 15 comments
I am facing difficulties in specifying GPU usage for different models for LLM inference pipeline using vLLM. Specifically, I have 4 RTX 4090 GPUs available, and I aim to run a LLM with a size of 42GB on 2 RTX 4090 GPUs (~48GB) and a separate model with a size of 22GB on 1 RTX 4090 GPU(`24GB).
This is my code for running 42GB model on two GPUs.
from vllm import LLM
llm = LLM(model_name, max_model_len=50, tensor_parallel_size=2)
output = llm.generate(text)
However, I haven't found a straightforward method within the VLLM library to specify which GPU should be used for each model.
You can specify the devices by using CUDA_VISIBLE_DEVICES
environment variable.
You can specify the devices by using
CUDA_VISIBLE_DEVICES
environment variable.
from vllm import LLM
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1,2"
llm_1 = LLM(llm_1_name,max_model_len=50,gpu_memory_utilization=0.9, tensor_parallel_size=2)
os.environ["CUDA_VISIBLE_DEVICES"] = "3"
llm_2 = LLM(llm_2_name,max_model_len=50,gpu_memory_utilization=0.9, tensor_parallel_size=1)
this still loads 2nd llm on 1 and 2 gpu and gives memory error
Try instantiate them in different script?
@simon-mo Separatly they work but my goal is to run two different LLMs. One LLM on 2 GPUs and Second LLM on 3rd GPU
from vllm import LLM
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
llm_1 = LLM(llm_1_name,max_model_len=50,gpu_memory_utilization=0.9, tensor_parallel_size=2)
os.environ["CUDA_VISIBLE_DEVICES"] = ""
os.environ["CUDA_VISIBLE_DEVICES"] = "2"
llm_2 = LLM(llm_2_name,max_model_len=50,gpu_memory_utilization=0.9, tensor_parallel_size=1)
RuntimeError Traceback (most recent call last)
Cell In[11], line 3
1 os.environ["CUDA_VISIBLE_DEVICES"] = "2"
----> 3 llm_2 = LLM("codellama/CodeLlama-7b-Instruct-hf",max_model_len=4000,gpu_memory_utilization=0.9, tensor_parallel_size=1)File /usr/local/lib/python3.8/dist-packages/vllm/entrypoints/llm.py:109, in LLM.init(self, model, tokenizer, tokenizer_mode, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, enforce_eager, max_context_len_to_capture, disable_custom_all_reduce, **kwargs)
90 kwargs["disable_log_stats"] = True
91 engine_args = EngineArgs(
92 model=model,
93 tokenizer=tokenizer,
(...)
107 **kwargs,
108 )
--> 109 self.llm_engine = LLMEngine.from_engine_args(engine_args)
110 self.request_counter = Counter()File /usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py:371, in LLMEngine.from_engine_args(cls, engine_args)
369 placement_group = initialize_cluster(parallel_config)
370 # Create the LLM engine.
--> 371 engine = cls(*engine_configs,
372 placement_group,
373 log_stats=not engine_args.disable_log_stats)
374 return engineFile /usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py:120, in LLMEngine.init(self, model_config, cache_config, parallel_config, scheduler_config, device_config, lora_config, placement_group, log_stats)
118 self._init_workers_ray(placement_group)
119 else:
--> 120 self._init_workers()
122 # Profile the memory usage and initialize the cache.
123 self._init_cache()File /usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py:163, in LLMEngine._init_workers(self)
149 distributed_init_method = get_distributed_init_method(
150 get_ip(), get_open_port())
151 self.driver_worker = Worker(
152 self.model_config,
153 self.parallel_config,
(...)
161 is_driver_worker=True,
162 )
--> 163 self._run_workers("init_model")
164 self._run_workers("load_model")File /usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py:1014, in LLMEngine._run_workers(self, method, driver_args, driver_kwargs, max_concurrent_workers, use_ray_compiled_dag, *args, **kwargs)
1011 driver_kwargs = kwargs
1013 # Start the driver worker after all the ray workers.
-> 1014 driver_worker_output = getattr(self.driver_worker,
1015 method)(*driver_args, **driver_kwargs)
1017 # Get the results of the ray workers.
1018 if self.workers:File /usr/local/lib/python3.8/dist-packages/vllm/worker/worker.py:94, in Worker.init_model(self, cupy_port)
91 raise RuntimeError(
92 f"Not support device type: {self.device_config.device}")
93 # Initialize the distributed environment.
---> 94 init_distributed_environment(self.parallel_config, self.rank,
95 cupy_port, self.distributed_init_method)
96 # Initialize the model.
97 set_random_seed(self.model_config.seed)File /usr/local/lib/python3.8/dist-packages/vllm/worker/worker.py:247, in init_distributed_environment(parallel_config, rank, cupy_port, distributed_init_method)
245 torch_world_size = torch.distributed.get_world_size()
246 if torch_world_size != parallel_config.world_size:
--> 247 raise RuntimeError(
248 "torch.distributed is already initialized but the torch world "
249 "size does not match parallel_config.world_size "
250 f"({torch_world_size} vs. {parallel_config.world_size}).")
251 elif not distributed_init_method:
252 raise ValueError(
253 "distributed_init_method must be set if torch.distributed "
254 "is not already initialized")RuntimeError: torch.distributed is already initialized but the torch world size does not match parallel_config.world_size (2 vs. 1).
I've had your exact same scenario, my solution was to run on docker-compose, because in there you can specify which GPU ids to make available to each instance
And then expose their APIs and consume with another script, it would be faster if you run the openai compatible API, however if you want to add something custom like lmformatenforcer
, you might need to make the implementation yourself
@KatIsCoding Thanks for your suggestion. Yeah I endedup with same thought that I have to implement the ray clustering by myself. What I have noticed is that when I initialized 2nd LLM object then it recreate a cluster of GPU/CPU. If I manually change CUDA_VISIBLE_DEVICES
before making 2nd LLM object in same python script then ray confuses and throw error because current configuration clash with 1st LLM object cluster.
In single process (script), you cannot make 2nd LLM object by changing CUDA_VISIBLE_DEVICES
.
@KatIsCoding can you share docker setup? I haven't have much experiance with docker. Thanks
@humza-sami were you able to figure out how to do this? I am facing the same problem and have no idea how to fix it atm. There is a solution using RAY but not sure how to implement that
Do you have any news on this issue?
@KatIsCoding can you share docker setup? I haven't have much experiance with docker. Thanks
@humza-sami were you able to figure out how to do this? I am facing the same problem and have no idea how to fix it atm. There is a solution using RAY but not sure how to implement that Do you have any news on this issue?
I'm sorry for my late response on the topic, as @sAviOr287 mentioned, there is a ray implementation out there, however I could not find much information about it.
So far my approach to the problem was just using docker and different instances for different models, like so:
version: "3.8"
networks:
load_balancing:
name: load_balancing
services:
sqlcoder:
profiles: [ai]
image: aiimage
shm_size: "15gb"
command: python3 ./aiplug.service.py
hostname: sqlcoder
networks:
- load_balancing
environment:
- MODEL_ID=defog/sqlcoder-7b-2
- TP_SIZE=1
- ACCEPT_EMPTY_IDS=1
build:
context: .
dockerfile: ./apps/VLLM/ai-service.Dockerfile
volumes:
- ./apps/VLLM/:/app:ro
- ./models:/aishared
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["1"]
capabilities: [gpu]
llama:
profiles: [ai-exp]
image: aiimage
shm_size: "15gb"
command: python3 ./aiplug.service.py
hostname: llama
networks:
- load_balancing
environment:
- AI_SERVICE_PORT=1337
- MODEL_ID=meta-llama/Meta-Llama-3-8B-Instruct
- ACCEPT_EMPTY_IDS=1
- TP_SIZE=1
build:
context: .
dockerfile: ./apps/VLLM/ai-service.Dockerfile
volumes:
- ./apps/VLLM/:/app:ro
- ./models:/aishared
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0"]
capabilities: [gpu]
nginx:
image: nginx:1.15-alpine
profiles: [ai]
networks:
- load_balancing
depends_on:
- sqlcoder
- llama
volumes:
- ./nginx-conf:/etc/nginx/conf.d
ports:
- 6565:6565 #SQL Coder
- 6566:6566 #Llama
It is a load balancing approach, however a different model gets hit depending on which port you are using.
My dockerfile is pretty much just installing vllm + some other stuff, however it could be completely replaced with something like the OpenAI implementation vllm has.
The most important thing about the configuration is the usage of
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0"]
capabilities: [gpu]
By specifying a device_ids
you are essentially telling docker which GPUs to make available in each process
Any one found any solution , I am trying to use it with accelerate but getting same error
I found that specifying GPU ids for ray-executor could be achieved by modifying worker_node_and_gpu_ids
in vllm/executor/ray_gpu_executor.py
Thanks for the suggestion do you have any example code for this ? I don't think I fully understand your solution.
Hi @sAviOr287 , I added the following code in vllm/executor/ray_gpu_executor.py (the GPU id that I want to use is given in self.GPUs
):
# update GPU IDs if specified.
if self.GPUs is not None:
assert (len(self.GPUs) == len(worker_node_and_gpu_ids)), "Number of GPUs specified does not match the number of workers."
for i, (node_id, gpu_ids) in enumerate(worker_node_and_gpu_ids):
worker_node_and_gpu_ids[i] = (node_id, [self.GPUs[i]])