openucx/ucx

UCX ignores exclusively setting TCP devices when RoCE is available.

bertiethorpe opened this issue · 8 comments

Describe the bug

Setting UCX_NET_DEVICES to target only TCP devices when RoCE is available seems to be ignored in favour of some fallback.

I'm running a 2 node IMB_MPI PingPong to benchmark RoCE against regular TCP ethernet.

Setting UCX_NET_DEVICES=all or mlx5_0:1 gives the optimal performance and uses RDMA as expected.
Setting UCX_NET_DEVICES=eth0, eth1, or anything else still appears to use RoCE at only a slightly longer latency

As per the docs, setting UCX_NET_DEVICES to one of the TCP devices, I should expect TCP-like latencies of ~15us but am seeing closer to RoCE performance with latencies ~2.1us.

Stranger still, is the latency for specifically targeting mlx5_0:1 or all is different (lower latency ~1.6us), so it looks like the fallback is not all when setting to eth0 etc.

Is this behaviour determined somewhere else or accounted for in some way?

Steps to Reproduce

  • Batch Script:
#!/usr/bin/env bash

#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=1
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.out
#SBATCH --exclusive
#SBATCH --partition=standard

module load gnu12 openmpi4 imb

export UCX_NET_DEVICES=mlx5_0:1

echo SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST
echo SLURM_JOB_ID: $SLURM_JOB_ID
echo UCX_NET_DEVICES: $UCX_NET_DEVICES

export UCX_LOG_LEVEL=data
#srun --mpi=pmi2 IMB-MPI1 pingpong # doesn't work in ohpc v2.1
mpirun IMB-MPI1 pingpong -iter_policy off
  • UCX version 1.17.0
  • Git branch '', revision 7bb2722
    Configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --without-go --without-java --enable-cma --without-cuda --without-gdrcopy --with-verbs --with-knem --with-rdmacm --without-rocm --with-xpmem --without-fuse3 --without-ugni --without-mad --without-ze
  • Any UCX environment variables used
    • See logs

Setup and versions

  • OS version (e.g Linux distro)
    • Rocky Linux release 9.4 (Blue Onyx)
  • Driver version:
    • rdma-core-2404mlnx51-1.2404066.x86_64
    • MLNX_OFED_LINUX-24.04-0.6.6.0
  • HW information from ibstat or ibv_devinfo -vv command
    •  transport:                      InfiniBand (0)
       fw_ver:                         20.36.1010
       node_guid:                      fa16:3eff:fe4f:f5e9
       sys_image_guid:                 0c42:a103:0003:5d82
       vendor_id:                      0x02c9
       vendor_part_id:                 4124
       hw_ver:                         0x0
       board_id:                       MT_0000000224
       phys_port_cnt:                  1
               port:   1
                       state:                  PORT_ACTIVE (4)
                       max_mtu:                4096 (5)
                       active_mtu:             1024 (3)
                       sm_lid:                 0
                       port_lid:               0
                       port_lmc:               0x00
                       link_layer:             Ethernet
      
      

Additional information (depending on the issue)

  • OpenMPI version 4.1.5

Logs:

Hi @bertiethorpe
In the attached eth0.txt log file, there's no evidence of UCX connection establishment, also the environment variable UCX_NET_DEVICES is not propagated to the config parser - unlike in the mlxlog.txt file.

Therefore we suggest:

  1. Please double-check the command line for both cases and ensure UCX is used.
  2. Run ucx_info -e -u t -P inter with various UCX_NET_DEVICES and check whether the used devices are the ones you expect.

@bertiethorpe can you pls run with UCX_NET_DEVICES=eth0 and also add -mca pml_base_verbose 99 -mca pml_ucx_verbose 99 -mca pml ucx to mpirun?
Also, what were the configure flags for OpenMPI?
It seems OpenMPI is not using UCX component when UCX_NET_DEVICES=eth0, due to a higher priority of OpenMPI's btl/openib component, which is also using RDMA.

Some more information:

  • This is all virtualised

Run ucx_info -e -u t -P inter with various UCX_NET_DEVICES and check whether the used devices are the ones you expect.

ucx_info -e -u t -P inter
#
# UCP endpoint 
#
#               peer: <no debug data>
#                 lane[0]:  8:rc_mlx5/mlx5_0:1.0 md[4]      -> md[4]/ib/sysdev[255] rma_bw#0 am am_bw#0
#                 lane[1]:  3:tcp/eth1.0 md[1]              -> md[1]/tcp/sysdev[255] rma_bw#1 wireup
#
#                tag_send: 0..<egr/short>..227..<egr/bcopy>..263060..<rndv>..(inf)
#            tag_send_nbr: 0..<egr/short>..227..<egr/bcopy>..262144..<rndv>..(inf)
#           tag_send_sync: 0..<egr/short>..227..<egr/bcopy>..263060..<rndv>..(inf)
#
#                  rma_bw: mds [1] [4] #
#                     rma: mds rndv_rkey_size 19
#
UCX_NET_DEVICES=eth0 ucx_info -e -u t -P inter
#
# UCP endpoint 
#
#               peer: <no debug data>
#                 lane[0]:  1:tcp/eth0.0 md[1]              -> md[1]/tcp/sysdev[255] rma_bw#0 am am_bw#0 wireup
#
#                tag_send: 0..<egr/short>..8185..<egr/zcopy>..20424..<rndv>..(inf)
#            tag_send_nbr: 0..<egr/short>..8185..<egr/bcopy>..262144..<rndv>..(inf)
#           tag_send_sync: 0..<egr/short>..8185..<egr/zcopy>..20424..<rndv>..(inf)
#
#                  rma_bw: mds [1] #
#                     rma: mds rndv_rkey_size 10
#
UCX_NET_DEVICES=eth1 ucx_info -e -u t -P inter
#
# UCP endpoint 
#
#               peer: <no debug data>
#                 lane[0]:  1:tcp/eth1.0 md[1]              -> md[1]/tcp/sysdev[255] rma_bw#0 am am_bw#0 wireup
#
#                tag_send: 0..<egr/short>..8185..<egr/zcopy>..19505..<rndv>..(inf)
#            tag_send_nbr: 0..<egr/short>..8185..<egr/bcopy>..262144..<rndv>..(inf)
#           tag_send_sync: 0..<egr/short>..8185..<egr/zcopy>..19505..<rndv>..(inf)
#
#                  rma_bw: mds [1] #
#                     rma: mds rndv_rkey_size 10
#

Are these expected? I should be expecting the mlx to be with eth1 because they're on the same NIC

can you pls run with UCX_NET_DEVICES=eth0 and also add -mca pml_base_verbose 99 -mca pml_ucx_verbose 99 -mca pml ucx to mpirun?

ucxlog.txt

can you pls run with UCX_NET_DEVICES=eth0 and also add -mca pml_base_verbose 99 -mca pml_ucx_verbose 99 -mca pml ucx to mpirun?

ucxlog.txt

Can you pls configure OpenMPI with --with-platform=contrib/platform/mellanox/optimized ?
It will force using UCX also with TCP transports.
Alternatively, can add -mca pml_ucx_tls any -mca pml_ucx_devices any to mpirun

ucxlog2.txt

So that seems to have done the trick. Now getting the latency I expected.

It seems OpenMPI is not using UCX component when UCX_NET_DEVICES=eth0, due to a higher priority of OpenMPI's btl/openib component, which is also using RDMA.

Where can you see this in the logs? Forgive my ignorance, but I can't actually see the btl openib component is available at all. Was it removed in v4.1.x?

ompi_info |  grep btl
                 MCA btl: ofi (MCA v2.1.0, API v3.1.0, Component v4.1.5)
                 MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.1.5)
                 MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.1.5)
                 MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.1.5)
                MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v4.1.5)

This is all I see.