how does use PCIe peer-to-peer or NVLink between two containers that each have an isolated GPU
Opened this issue · 3 comments
I am a new user of UCX. Now have a situation where two different containers each use different GPU, and the two GPUs devices on the Host can communicate via PCIe P2P or NVLink. But in containers they can't communicate via PCIe P2P or NVLink.
I am looking how to solve this problem.
See the NVLink and Docker/Kubernetes section of the ucx-py readthedocs documentation: In order to use NVLink when running in containers using Docker and/or Kubernetes the processes must share an IPC namespace for NVLink to work correctly.
Who can answer that can UCX solve this problem? And How can this problem be solved, if at all.
Your assistance in this matter will be greatly appreciated.
Please try to share process IDs between containers. E.g. add the following option to the command running the first docker:
--name docker_1
, and to the second CL:
--pid=container:docker_1
Then containers will share PID namespace.
@rakhmets Thank you for your reply and suggestions.
I tried your method by:
The first container:
docker run --name master -it --rm --gpus device=0 --network bridge --ipc host -v $(pwd):/data --entrypoint /bin/bash nvcr.io/nvidia/pytorch:24.01-py3
The second container:
docker run -it --rm --gpus device=1 --network bridge --ipc host --pid 'container:master' -v $(pwd):/data --entrypoint /bin/bash nvcr.io/nvidia/pytorch:24.01-py3
The two containers each use different GPU, following is the topology shown by nvidia-smi topo -m
:
GPU0 GPU1 GPU2 GPU3 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV12 SYS SYS PIX PIX SYS SYS 0-15,32-47 0 N/A
GPU1 NV12 X SYS SYS PIX PIX SYS SYS 0-15,32-47 0 N/A
GPU2 SYS SYS X NV12 SYS SYS PIX PIX 16-31,48-63 1 N/A
GPU3 SYS SYS NV12 X SYS SYS PIX PIX 16-31,48-63 1 N/A
NIC0 PIX PIX SYS SYS X PIX SYS SYS
NIC1 PIX PIX SYS SYS PIX X SYS SYS
NIC2 SYS SYS PIX PIX SYS SYS X PIX
NIC3 SYS SYS PIX PIX SYS SYS PIX X
And then run command in this container:
The first container:
torchrun --nnodes 2 --nproc_per_node 1 --node_rank 0 --master_addr 172.17.0.2 --master_port 29400 multinode.py
The second container:
torchrun --nnodes 2 --nproc_per_node 1 --node_rank 1 --master_addr 172.17.0.2 --master_port 29400 multinode.py
But as a result, the first container reported an error, and the output is as follows:
[1724241372.462397] [2a292d2c18cc:984 :0] tl_cuda_cache.c:231 UCC ERROR ipc-cache: failed to open ipc mem handle. addr:0x7fe456000000 len:16777216 err:1
Traceback (most recent call last):
File "/data/multinode.py", line 141, in
main(args.save_every, args.total_epochs, args.batch_size)
File "/data/multinode.py", line 128, in main
trainer = Trainer(model, train_data, optimizer, save_every, snapshot_path)
File "/data/multinode.py", line 65, in init
self.model = DDP(self.model, device_ids=[self.local_rank])
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 783, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/utils.py", line 264, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile withTORCH_USE_CUDA_DSA
to enable device-side assertions.
[2024-08-21 11:56:17,385] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 984) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 351, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
.
.
.
Root Cause (first observed failure):
[0]:
time : 2024-08-21_11:56:17
host : 2a292d2c18cc
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 984)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
And the second container is stuck with no output.
My understanding is that UCC is a communication library established based on UCX. I don't know if my understanding is wrong. If so, please tell me. Later, I looked at the code location of the UCC error, which uses the CUDA IPC interface.
Does this interface require two GPUs to be used without container splitting?
So I tried to mount both GPUs into containers using the --gpus parameter, both containers using the same two GPUs.
This time it should work. Both containers have outputs. However, nvidia-smi observed that GPU0 was used by both containers, while GPU1 was not.
So I would like to ask whether this error was caused by UCC? If so, could you please give an example of UCX?
Looking forward to your reply and suggestions.
Yes, UCC is a communication library that provides interfaces for collective operations. UCC uses UCX as one of the possible transports for point-to-point communications.
I guess the reason two processes in different containers are using the same device is because both processes are taking the first device available on the system. E.g. you can set CUDA_VISIBLE_DEVICES=0
in one container and CUDA_VISIBLE_DEVICES=1
to force the use of different devices.