What is UCX's policy of choosing transport?
rzambre opened this issue · 2 comments
Describe the bug
In the past (with UCX 1.5.0), I used to set UCX_NET_DEVICES=mlx5_0:1
and UCX_TLS=rc_mlx5,rc
and hope that rc_mlx5
would be used during the fast-path operations. If I set UCX_TLS=rc_mlx5
only, I would get an error during ucp_init
.
With the latest UCX master
, I see with ucx_info -d
that there are rc_verbs
and rc_mlx5
transports. But when I set UCX_TLS=rc_mlx5,rc_verbs
, I get an error during initialization. But after playing around, I discovered that setting UCX_TLS=rc_mlx5,rc
(as I had done earlier) works even though rc
is not listed in ucx_info -d
.
(1) What is the difference between setting UCX_TLS=rc_mlx5,rc_verbs
and UCX_TLS=rc_mlx5,rc
?
What works with using the transports listed in ucx_info -d
is UCX_TLS=rc_mlx5,ud_[mlx5|verbs]
.
(2) More generally, is there an overview of how UCX chooses which transport to use for its critical-path operations such as ucp_tag_send_nb
?
Steps to Reproduce
- Command line:
mpiexec -n 2 -ppn 1 -hosts <node1>,<node2> -env UCX_NET_DEVICES mlx5_0:1 -env UCX_TLS=rc_mlx5,rc_verbs ./osu_mbw_mr
- UCX version used:
master
@eaad8e2
+ UCX configure flags:--disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt
- MPICH/CH4/UCX @
d1e673a
Setup and versions
- [rzambre@hpc3-14-12:~] $cat /etc/system-release
CentOS Linux release 7.7.1908 (Core) - [rzambre@hpc3-14-12:~] $uname -a
Linux hpc3-14-12 3.10.0-1062.12.1.el7.x86_64 #1 SMP Tue Feb 4 23:02:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
@rzambre pls see https://openucx.readthedocs.io/en/master/faq.html#selecting-networks-and-transports
rc_verbs and rc_mlx5 should not be used directly, but rather the ones listed in https://openucx.readthedocs.io/en/master/faq.html#list-of-main-transports-and-aliases
Thanks! Wasn't aware of the new documentation. That is helpful.