openucx/ucx

What is UCX's policy of choosing transport?

rzambre opened this issue · 2 comments

Describe the bug

In the past (with UCX 1.5.0), I used to set UCX_NET_DEVICES=mlx5_0:1 and UCX_TLS=rc_mlx5,rc and hope that rc_mlx5 would be used during the fast-path operations. If I set UCX_TLS=rc_mlx5 only, I would get an error during ucp_init.

With the latest UCX master, I see with ucx_info -d that there are rc_verbs and rc_mlx5 transports. But when I set UCX_TLS=rc_mlx5,rc_verbs, I get an error during initialization. But after playing around, I discovered that setting UCX_TLS=rc_mlx5,rc (as I had done earlier) works even though rc is not listed in ucx_info -d.

(1) What is the difference between setting UCX_TLS=rc_mlx5,rc_verbs and UCX_TLS=rc_mlx5,rc?

What works with using the transports listed in ucx_info -d is UCX_TLS=rc_mlx5,ud_[mlx5|verbs].

(2) More generally, is there an overview of how UCX chooses which transport to use for its critical-path operations such as ucp_tag_send_nb?

Steps to Reproduce

  • Command line: mpiexec -n 2 -ppn 1 -hosts <node1>,<node2> -env UCX_NET_DEVICES mlx5_0:1 -env UCX_TLS=rc_mlx5,rc_verbs ./osu_mbw_mr
  • UCX version used: master @ eaad8e2 + UCX configure flags: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt
  • MPICH/CH4/UCX @ d1e673a

Setup and versions

  • [rzambre@hpc3-14-12:~] $cat /etc/system-release
    CentOS Linux release 7.7.1908 (Core)
  • [rzambre@hpc3-14-12:~] $uname -a
    Linux hpc3-14-12 3.10.0-1062.12.1.el7.x86_64 #1 SMP Tue Feb 4 23:02:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Thanks! Wasn't aware of the new documentation. That is helpful.