Rank Assignment Issue under four containers on two different servers.
thsmfe001 opened this issue · 8 comments
I faced with rank asssigment issue during nccl test on four containers. My test environment is two servers with two GPUs per a server.
When i issued below command i got wrong rank assignment that i inteded.
My intention was the rank should be assinged from 0 to 7 distributed manner. But i got below output.
Could you provide any solution to solve this problem?
root@c5e62fb2396d:/workspace# cat rankfile
rank 0=10.10.10.2 slot=0
rank 1=10.10.11.2 slot=0
rank 2=10.10.20.2 slot=0
rank 3=10.10.21.2 slot=0
rank 4=10.10.10.2 slot=1
rank 5=10.10.11.2 slot=1
rank 6=10.10.20.2 slot=1
rank 7=10.10.21.2 slot=1
root@c5e62fb2396d:/workspace# mpirun -np 4 -allow-run-as-root -host 10.10.10.2,10.10.11.2,10.10.20.2,10.10.21.2 -rf rankfile /workspace/software/nccl-tests-master/build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
WARNING: Open MPI tried to bind a process but failed. This is a
warning only; your job will continue, though performance may
be degraded.
Local host: c5e62fb2396d
Application name: /workspace/software/nccl-tests-master/build/all_reduce_perf
Error message: failed to bind memory
Location: rtc_hwloc.c:447
nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
Using devices
nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
Using devices
nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
Using devices
nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
Using devices
Rank 0 Group 0 Pid 3077 on a1eb0914a5e2 device 0 [0x61] NVIDIA L40
Rank 1 Group 0 Pid 3077 on a1eb0914a5e2 device 1 [0xe1] NVIDIA L40
Rank 0 Group 0 Pid 4482 on cb0142391811 device 0 [0x61] NVIDIA L40
Rank 1 Group 0 Pid 4482 on cb0142391811 device 1 [0xe1] NVIDIA L40
Rank 0 Group 0 Pid 1176 on c5e62fb2396d device 0 [0xca] NVIDIA L40
Rank 1 Group 0 Pid 1176 on c5e62fb2396d device 1 [0xe1] NVIDIA L40
Rank 0 Group 0 Pid 3860 on 877a4a03d442 device 0 [0xca] NVIDIA L40
Rank 1 Group 0 Pid 3860 on 877a4a03d442 device 1 [0xe1] NVIDIA L40
It doesn't look like the nccl-tests were compiled with MPI=1
I just followed instruction of readme page. I just downloaded and execute make command.
You mean i need to recomplie with below command?
make MPI=1 MPI_HOME=/path/to/mpi CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl
Yes, it looks like you're trying to run binaries that are not MPI enabled so you just end up with 4 processes each with 2 GPUs.
Thank you so much. I will try on it and then i will post the result to you.
I just succeeded reconpiling with MPI options. Then i got below error messages. Based on my investigation of recompiled library, np 1 with any hosts can work proprely but more two processors with np 2 leaded to error. I think it would be caused by MPI communication. Could you check attached error logs?
root@c5e62fb2396d:/workspace# mpirun -np 4 -allow-run-as-root -host 10.10.10.2,10.10.11.2,10.10.20.2,10.10.21.2 /workspace/software/nccl-tests-master/build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
[1716868653.124073] [c5e62fb2396d:1618 :0] tl_ucp_ep.c:43 TL_UCP ERROR ucp returned connect error: Endpoint timeout
[1716868653.124093] [c5e62fb2396d:1618 :0] tl_ucp_ep.h:79 TL_UCP ERROR failed to connect team ep
[1716868653.124098] [c5e62fb2396d:1618 :0] tl_ucp_sendrecv.h:108 TL_UCP ERROR tag 32760; dest 1; team_id 0; errmsg No pending message
[c5e62fb2396d:1618 :0:1618] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffffffffefb)
[1716868653.124031] [877a4a03d442:1142 :0] tl_ucp_ep.c:43 TL_UCP ERROR ucp returned connect error: Endpoint timeout
[1716868653.124050] [877a4a03d442:1142 :0] tl_ucp_ep.h:79 TL_UCP ERROR failed to connect team ep
[1716868653.124056] [877a4a03d442:1142 :0] tl_ucp_sendrecv.h:108 TL_UCP ERROR tag 32760; dest 0; team_id 0; errmsg No pending message
[1716868653.119230] [cb0142391811:1139 :0] tl_ucp_ep.c:43 TL_UCP ERROR ucp returned connect error: Endpoint timeout
[1716868653.119252] [cb0142391811:1139 :0] tl_ucp_ep.h:79 TL_UCP ERROR failed to connect team ep
[1716868653.119258] [cb0142391811:1139 :0] tl_ucp_sendrecv.h:108 TL_UCP ERROR tag 32760; dest 3; team_id 0; errmsg No pending message
[877a4a03d442:1142 :0:1142] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffffffffefb)
[cb0142391811:1139 :0:1139] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffffffffefb)
[1716868653.119305] [a1eb0914a5e2:1167 :0] tl_ucp_ep.c:43 TL_UCP ERROR ucp returned connect error: Endpoint timeout
[1716868653.119324] [a1eb0914a5e2:1167 :0] tl_ucp_ep.h:79 TL_UCP ERROR failed to connect team ep
[1716868653.119329] [a1eb0914a5e2:1167 :0] tl_ucp_sendrecv.h:108 TL_UCP ERROR tag 32760; dest 2; team_id 0; errmsg No pending message
[a1eb0914a5e2:1167 :0:1167] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffffffffefb)
Uploading error logs.txt…
It looks like you are having issues running MPI jobs. Perhaps get a simple "hello world" MPI program working first before attempting to run the NCCL tests.
But with UCX based MPI I often find export UCX_TLS=tcp
helps most issues. You may also need to select the correct UCX device with UCX_NET_DEVICES
Thank you for your quick feedback. I just recompile with all option with make command based on readme page.
"make MPI=1 MPI_HOME=/usr/local/mpi CUDA_HOME=/usr/local/cuda NCCL_HOME=/usr/lib/x86_64-linux-gnu"
After test again based new library and if i faced with same issue i'll adopt your recommandation.
I'll update the result to you. Thank you.
Thank you for support. After recompling and applying UCX_TLS=tcp the test was well done.
I really appreciate you about quick support again.