UCX ignores exclusively setting TCP devices when RoCE is available.
bertiethorpe opened this issue · 8 comments
Describe the bug
Setting UCX_NET_DEVICES to target only TCP devices when RoCE is available seems to be ignored in favour of some fallback.
I'm running a 2 node IMB_MPI PingPong to benchmark RoCE against regular TCP ethernet.
Setting UCX_NET_DEVICES=all
or mlx5_0:1
gives the optimal performance and uses RDMA as expected.
Setting UCX_NET_DEVICES=eth0
, eth1
, or anything else still appears to use RoCE at only a slightly longer latency
As per the docs, setting UCX_NET_DEVICES to one of the TCP devices, I should expect TCP-like latencies of ~15us but am seeing closer to RoCE performance with latencies ~2.1us.
Stranger still, is the latency for specifically targeting mlx5_0:1
or all
is different (lower latency ~1.6us), so it looks like the fallback is not all
when setting to eth0
etc.
Is this behaviour determined somewhere else or accounted for in some way?
Steps to Reproduce
- Batch Script:
#!/usr/bin/env bash
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=1
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.out
#SBATCH --exclusive
#SBATCH --partition=standard
module load gnu12 openmpi4 imb
export UCX_NET_DEVICES=mlx5_0:1
echo SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST
echo SLURM_JOB_ID: $SLURM_JOB_ID
echo UCX_NET_DEVICES: $UCX_NET_DEVICES
export UCX_LOG_LEVEL=data
#srun --mpi=pmi2 IMB-MPI1 pingpong # doesn't work in ohpc v2.1
mpirun IMB-MPI1 pingpong -iter_policy off
- UCX version 1.17.0
- Git branch '', revision 7bb2722
Configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --without-go --without-java --enable-cma --without-cuda --without-gdrcopy --with-verbs --with-knem --with-rdmacm --without-rocm --with-xpmem --without-fuse3 --without-ugni --without-mad --without-ze
- Any UCX environment variables used
- See logs
Setup and versions
- OS version (e.g Linux distro)
- Rocky Linux release 9.4 (Blue Onyx)
- Driver version:
- rdma-core-2404mlnx51-1.2404066.x86_64
- MLNX_OFED_LINUX-24.04-0.6.6.0
- HW information from
ibstat
oribv_devinfo -vv
command-
transport: InfiniBand (0) fw_ver: 20.36.1010 node_guid: fa16:3eff:fe4f:f5e9 sys_image_guid: 0c42:a103:0003:5d82 vendor_id: 0x02c9 vendor_part_id: 4124 hw_ver: 0x0 board_id: MT_0000000224 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet
-
Additional information (depending on the issue)
- OpenMPI version 4.1.5
Logs:
Hi @bertiethorpe
In the attached eth0.txt
log file, there's no evidence of UCX connection establishment, also the environment variable UCX_NET_DEVICES
is not propagated to the config parser - unlike in the mlxlog.txt
file.
Therefore we suggest:
- Please double-check the command line for both cases and ensure UCX is used.
- Run
ucx_info -e -u t -P inter
with variousUCX_NET_DEVICES
and check whether the used devices are the ones you expect.
@bertiethorpe can you pls run with UCX_NET_DEVICES=eth0 and also add -mca pml_base_verbose 99 -mca pml_ucx_verbose 99 -mca pml ucx
to mpirun?
Also, what were the configure flags for OpenMPI?
It seems OpenMPI is not using UCX component when UCX_NET_DEVICES=eth0, due to a higher priority of OpenMPI's btl/openib component, which is also using RDMA.
Some more information:
- This is all virtualised
Run ucx_info -e -u t -P inter with various UCX_NET_DEVICES and check whether the used devices are the ones you expect.
ucx_info -e -u t -P inter
#
# UCP endpoint
#
# peer: <no debug data>
# lane[0]: 8:rc_mlx5/mlx5_0:1.0 md[4] -> md[4]/ib/sysdev[255] rma_bw#0 am am_bw#0
# lane[1]: 3:tcp/eth1.0 md[1] -> md[1]/tcp/sysdev[255] rma_bw#1 wireup
#
# tag_send: 0..<egr/short>..227..<egr/bcopy>..263060..<rndv>..(inf)
# tag_send_nbr: 0..<egr/short>..227..<egr/bcopy>..262144..<rndv>..(inf)
# tag_send_sync: 0..<egr/short>..227..<egr/bcopy>..263060..<rndv>..(inf)
#
# rma_bw: mds [1] [4] #
# rma: mds rndv_rkey_size 19
#
UCX_NET_DEVICES=eth0 ucx_info -e -u t -P inter
#
# UCP endpoint
#
# peer: <no debug data>
# lane[0]: 1:tcp/eth0.0 md[1] -> md[1]/tcp/sysdev[255] rma_bw#0 am am_bw#0 wireup
#
# tag_send: 0..<egr/short>..8185..<egr/zcopy>..20424..<rndv>..(inf)
# tag_send_nbr: 0..<egr/short>..8185..<egr/bcopy>..262144..<rndv>..(inf)
# tag_send_sync: 0..<egr/short>..8185..<egr/zcopy>..20424..<rndv>..(inf)
#
# rma_bw: mds [1] #
# rma: mds rndv_rkey_size 10
#
UCX_NET_DEVICES=eth1 ucx_info -e -u t -P inter
#
# UCP endpoint
#
# peer: <no debug data>
# lane[0]: 1:tcp/eth1.0 md[1] -> md[1]/tcp/sysdev[255] rma_bw#0 am am_bw#0 wireup
#
# tag_send: 0..<egr/short>..8185..<egr/zcopy>..19505..<rndv>..(inf)
# tag_send_nbr: 0..<egr/short>..8185..<egr/bcopy>..262144..<rndv>..(inf)
# tag_send_sync: 0..<egr/short>..8185..<egr/zcopy>..19505..<rndv>..(inf)
#
# rma_bw: mds [1] #
# rma: mds rndv_rkey_size 10
#
Are these expected? I should be expecting the mlx to be with eth1 because they're on the same NIC
can you pls run with UCX_NET_DEVICES=eth0 and also add -mca pml_base_verbose 99 -mca pml_ucx_verbose 99 -mca pml ucx to mpirun?
can you pls run with UCX_NET_DEVICES=eth0 and also add -mca pml_base_verbose 99 -mca pml_ucx_verbose 99 -mca pml ucx to mpirun?
Can you pls configure OpenMPI with --with-platform=contrib/platform/mellanox/optimized
?
It will force using UCX also with TCP transports.
Alternatively, can add -mca pml_ucx_tls any -mca pml_ucx_devices any
to mpirun
So that seems to have done the trick. Now getting the latency I expected.
It seems OpenMPI is not using UCX component when UCX_NET_DEVICES=eth0, due to a higher priority of OpenMPI's btl/openib component, which is also using RDMA.
Where can you see this in the logs? Forgive my ignorance, but I can't actually see the btl openib component is available at all. Was it removed in v4.1.x?
ompi_info | grep btl
MCA btl: ofi (MCA v2.1.0, API v3.1.0, Component v4.1.5)
MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.1.5)
MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.1.5)
MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.1.5)
MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v4.1.5)
This is all I see.