NCCL WARN Cannot use cuda/gdr transports as part of specified UCX_TLS
liuxingbo12138 opened this issue · 5 comments
when i run nccl-test with sharp, i meet the error, what cause this
I tested using the NGC 24.05 version image, so it shouldn't be an environmental issue, if I remove the cuda_copy parameter from NCCL_UCX-TLS=rc_x, it can run normally, and there will also be job creation related to sharp in ufm, but the speed is the same as not driving Sharp
mpirun -mca plm_rsh_args "-p 12138" --allow-run-as-root --bind-to socket -x LD_LIBRARY_PATH -x NCCL_UCX_RNDV_THRESH=0 -x UCX_MEMTYPE_CACHE=n -x NCCL_COLLNET_ENABLE=1 -x NCCL_PLUGIN_P2P=ucx -x NCCL_DEBUG_SUBSYS=NET -x NCCL_DEBUG=INFO -x NCCL_IB_HCA=mlx5_10,mlx5_11,mlx5_12,mlx5_13,mlx5_14,mlx5_15,mlx5_16,mlx5_17 -x NCCL_SOCKET_IFNAME=eth5 -x NCCL_COLLNET_ENABLE=1 --host 10.101.42.2:8,10.101.42.3:8,10.101.42.4:8,10.101.42.5:8 ./build/all_reduce_perf -b 4G -e 4G -f 0 -i 0 -g 1
Do you need to use -x NCCL_PLUGIN_P2P=ucx
I don't believe that is the default for the SHARP plugin.
You don't need UCX in order to use SHARP in the external plugin.
Do you need to use
-x NCCL_PLUGIN_P2P=ucx
I don't believe that is the default for the SHARP plugin. You don't need UCX in order to use SHARP in the external plugin.
When I run megatron llm, if I don't specify NCCL-PLUGIN-P2P=ucx
and NCCL_UCX-TLS=rc_x, cuda_copy
parameters, it will report the following error
I want to know how much difference there is between using sharp and not using sharp in large model training
Is SHARP enabled and configured on your system? I think you need to contact the system vendor or an Nvidia support representative/SA in order to be able to diagnose this complex issue with using SHARP and LLM training.
Is SHARP enabled and configured on your system? I think you need to contact the system vendor or an Nvidia support representative/SA in order to be able to diagnose this complex issue with using SHARP and LLM training.
OK,thanks