Long-Tail Requests
Clownier opened this issue · 2 comments
Describe the bug
We are using the RDMA transport service provided by UCX under a ROCE v2 network and have observed 1 to 2-second long-tail requests. The frequency of these long-tail requests appears to correlate positively with system pressure and the size of requested IOs. Additionally, we have noticed that the eth's tx_pause counter and bond1's ecn, cnp counters continue to increase even when the traffic volume is relatively low (under 500MB per machine). Is this behavior normal?
Our business model is depicted in the attached diagram, where we have encapsulated the Server and Client ends using UCX. The primary APIs used are ucp_tag_send_nb and ucp_tag_recv_nb. Each Server end establishes connections with multiple Client ends. The long-tail issue primarily occurs between the Middle and Tail processes, both of which use the rc_x transport mode.
During the request-response interaction between the Client and Server ends facilitated by UCX's TagMatch, the Client initially sends a request header with a specific Tag to the Server, followed by the corresponding data field for that Tag. Upon receiving the request header, the Server parses the Tag and initiates data reception for that Tag. The Server then sends a Response back to the Client, also in the form of a request header. The request headers utilize a fixed Tag matching scheme, with the first bit set to 1.
After troubleshooting, we have identified that the primary cause of the long-tail issue lies in the Server's inability to receive data. For a duration of approximately 2 seconds, the Server's Progress function can only receive new request headers but fails to receive the data or complete the sending of the Response (i.e., the send callback is not invoked).
Steps to Reproduce
Command line
UCX version used: v1.12.0
UCX configure flags
# configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --disable-params-check --without-java --enable-cma --without-cuda --without-gdrcopy --with-verbs --with-knem --with-rdmacm --without-rocm --without-xpmem --without-fuse3 --without-ugni
Any UCX environment variables used
setenv("UCX_MAX_EAGER_LANES", "2", 1);
setenv("UCX_IB_SEG_SIZE", "2k", 1);
setenv("UCX_RC_RX_QUEUE_LEN", "1024", 1);
setenv("UCX_RC_MAX_RD_ATOMIC", "16", 1);
setenv("UCX_RC_ROCE_PATH_FACTOR", "2", 1);
setenv("UCX_RNDV_THRESH", "32k", 1);
setenv("UCX_IB_TRAFFIC_CLASS", "166", 1);
setenv("UCX_SOCKADDR_CM_ENABLE", "y", 1);
setenv("UCX_RC_MAX_GET_ZCOPY", "32k", 1);
setenv("UCX_RC_TX_NUM_GET_OPS", "8", 1);
setenv("UCX_RC_TX_NUM_GET_BYTES", "256k", 1);
setenv("UCX_RC_TX_CQ_MODERATION", "1", 1);
setenv("UCX_HANDLE_ERRORS", "", 1);
setenv("UCX_IB_FORK_INIT", "n", 1);
Setup and versions
- OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
CentOS Linux release 7.2.1511 (Core) 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
- For RDMA/IB/RoCE related issues:
- Driver version:
- rdma-core-52mlnx1-1.52104.x86_64
- MLNX_OFED_LINUX-5.2-1.0.4.0:
hca_id: mlx5_bond_0 transport: InfiniBand (0) fw_ver: 16.27.1016 node_guid: b8ce:f603:00e9:4d5e sys_image_guid: b8ce:f603:00e9:4d5e vendor_id: 0x02c9 vendor_part_id: 4119 hw_ver: 0x0 board_id: MT_0000000080 phys_port_cnt: 1 max_mr_size: 0xffffffffffffffff page_size_cap: 0xfffffffffffff000 max_qp: 262144 max_qp_wr: 32768 device_cap_flags: 0xed721c36 BAD_PKEY_CNTR BAD_QKEY_CNTR AUTO_PATH_MIG CHANGE_PHY_PORT PORT_ACTIVE_EVENT SYS_IMAGE_GUID RC_RNR_NAK_GEN MEM_WINDOW XRC MEM_MGT_EXTENSIONS MEM_WINDOW_TYPE_2B RAW_IP_CSUM MANAGED_FLOW_STEERING Unknown flags: 0xC8400000 max_sge: 30 max_sge_rd: 30 max_cq: 16777216 max_cqe: 4194303 max_mr: 16777216 max_pd: 16777216 max_qp_rd_atom: 16 max_ee_rd_atom: 0 max_res_rd_atom: 4194304 max_qp_init_rd_atom: 16 max_ee_init_rd_atom: 0 atomic_cap: ATOMIC_HCA (1) max_ee: 0 max_rdd: 0 max_mw: 16777216 max_raw_ipv6_qp: 0 max_raw_ethy_qp: 0 max_mcast_grp: 2097152 max_mcast_qp_attach: 240 max_total_mcast_qp_attach: 503316480 max_ah: 2147483647 max_fmr: 0 max_srq: 8388608 max_srq_wr: 32767 max_srq_sge: 31 max_pkeys: 128 local_ca_ack_delay: 16 general_odp_caps: ODP_SUPPORT ODP_SUPPORT_IMPLICIT rc_odp_caps: SUPPORT_SEND SUPPORT_RECV SUPPORT_WRITE SUPPORT_READ SUPPORT_SRQ uc_odp_caps: NO SUPPORT ud_odp_caps: SUPPORT_SEND xrc_odp_caps: SUPPORT_SEND SUPPORT_WRITE SUPPORT_READ SUPPORT_SRQ completion timestamp_mask: 0x7fffffffffffffff hca_core_clock: 156250kHZ raw packet caps: C-VLAN stripping offload Scatter FCS offload IP csum offload Delay drop device_cap_flags_ex: 0x30000055ED721C36 RAW_SCATTER_FCS PCI_WRITE_END_PADDING Unknown flags: 0x3000004100000000 tso_caps: max_tso: 262144 supported_qp: SUPPORT_RAW_PACKET rss_caps: max_rwq_indirection_tables: 65536 max_rwq_indirection_table_size: 2048 rx_hash_function: 0x1 rx_hash_fields_mask: 0x800000FF supported_qp: SUPPORT_RAW_PACKET max_wq_type_rq: 8388608 packet_pacing_caps: qp_rate_limit_min: 1kbps qp_rate_limit_max: 25000000kbps supported_qp: SUPPORT_RAW_PACKET tag matching not supported cq moderation caps: max_cq_count: 65535 max_cq_period: 4095 us maximum available device memory: 131072Bytes port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet max_msg_sz: 0x40000000 port_cap_flags: 0x04010000 port_cap_flags2: 0x0000 max_vl_num: invalid value (0) bad_pkey_cntr: 0x0 qkey_viol_cntr: 0x0 sm_sl: 0 pkey_tbl_len: 1 gid_tbl_len: 256 subnet_timeout: 0 init_type_reply: 0 active_width: 1X (1) active_speed: 25.0 Gbps (32) phys_state: LINK_UP (5) GID[ 0]: fe80:0000:0000:0000:bace:f6ff:fee9:4d5e, RoCE v1 GID[ 1]: fe80::bace:f6ff:fee9:4d5e, RoCE v2 GID[ 2]: fe80:0000:0000:0000:bace:f6ff:fee9:4d5e, RoCE v1 GID[ 3]: fe80::bace:f6ff:fee9:4d5e, RoCE v2 GID[ 4]: 0000:0000:0000:0000:0000:ffff:0a4e:0588, RoCE v1 GID[ 5]: ::ffff:10.78.5.136, RoCE v2
Additional information (depending on the issue)
- Output of ucx_info -d to show transports and devices recognized by UCX
#
# Memory domain: posix
# Component: posix
# allocate: <= 64967552K
# remote key: 24 bytes
# rkey_ptr is supported
#
# Transport: posix
# Device: memory
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 12179.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 100
# am_bcopy: <= 8256
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 8 bytes
# error handling: ep_check
#
#
# Memory domain: sysv
# Component: sysv
# allocate: unlimited
# remote key: 12 bytes
# rkey_ptr is supported
#
# Transport: sysv
# Device: memory
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 12179.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 100
# am_bcopy: <= 8256
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 8 bytes
# error handling: ep_check
#
#
# Memory domain: self
# Component: self
# register: unlimited, cost: 0 nsec
# remote key: 0 bytes
#
# Transport: self
# Device: memory0
# Type: loopback
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 6911.00 MB/sec
# latency: 0 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 8K
# am_bcopy: <= 8K
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 0 bytes
# iface address: 8 bytes
# error handling: ep_check
#
#
# Memory domain: tcp
# Component: tcp
# register: unlimited, cost: 0 nsec
# remote key: 0 bytes
#
# Transport: tcp
# Device: lo
# Type: network
# System device: <unknown>
#
# capabilities:
# bandwidth: 11.91/ppn + 0.00 MB/sec
# latency: 10960 nsec
# overhead: 50000 nsec
# put_zcopy: <= 18446744073709551590, up to 6 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 0
# am_short: <= 8K
# am_bcopy: <= 8K
# am_zcopy: <= 64K, up to 6 iov
# am_opt_zcopy_align: <= 1
# am_align_mtu: <= 0
# am header: <= 8037
# connection: to ep, to iface
# device priority: 1
# device num paths: 1
# max eps: 256
# device address: 18 bytes
# iface address: 2 bytes
# ep address: 10 bytes
# error handling: peer failure, ep_check, keepalive
#
# Transport: tcp
# Device: bond1
# Type: network
# System device: <unknown>
#
# capabilities:
# bandwidth: 5658.18/ppn + 0.00 MB/sec
# latency: 5212 nsec
# overhead: 50000 nsec
# put_zcopy: <= 18446744073709551590, up to 6 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 0
# am_short: <= 8K
# am_bcopy: <= 8K
# am_zcopy: <= 64K, up to 6 iov
# am_opt_zcopy_align: <= 1
# am_align_mtu: <= 0
# am header: <= 8037
# connection: to ep, to iface
# device priority: 1
# device num paths: 1
# max eps: 256
# device address: 6 bytes
# iface address: 2 bytes
# ep address: 10 bytes
# error handling: peer failure, ep_check, keepalive
#
#
# Connection manager: tcp
# max_conn_priv: 2064 bytes
#
# Memory domain: mlx5_bond_0
# Component: ib
# register: unlimited, cost: 180 nsec
# remote key: 8 bytes
# local memory handle is required for zcopy
#
# Transport: rc_verbs
# Device: mlx5_bond_0:1
# Type: network
# System device: mlx5_bond_0 (0)
#
# capabilities:
# bandwidth: 2739.46/ppn + 0.00 MB/sec
# latency: 800 + 1.000 * N nsec
# overhead: 75 nsec
# put_short: <= 124
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 5 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 1K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 5 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 1K
# am_short: <= 123
# am_bcopy: <= 8255
# am_zcopy: <= 8255, up to 4 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 1K
# am header: <= 127
# domain: device
# atomic_add: 64 bit
# atomic_fadd: 64 bit
# atomic_cswap: 64 bit
# connection: to ep
# device priority: 38
# device num paths: 2
# max eps: 256
# device address: 18 bytes
# ep address: 5 bytes
# error handling: peer failure, ep_check
#
#
# Transport: rc_mlx5
# Device: mlx5_bond_0:1
# Type: network
# System device: mlx5_bond_0 (0)
#
# capabilities:
# bandwidth: 2739.46/ppn + 0.00 MB/sec
# latency: 800 + 1.000 * N nsec
# overhead: 40 nsec
# put_short: <= 2K
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 14 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 1K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 14 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 1K
# am_short: <= 2046
# am_bcopy: <= 8254
# am_zcopy: <= 8254, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 1K
# am header: <= 186
# domain: device
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to ep
# device priority: 38
# device num paths: 2
# max eps: 256
# device address: 18 bytes
# ep address: 7 bytes
# error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
# Transport: dc_mlx5
# Device: mlx5_bond_0:1
# Type: network
# System device: mlx5_bond_0 (0)
#
# capabilities:
# bandwidth: 2739.46/ppn + 0.00 MB/sec
# latency: 860 nsec
# overhead: 40 nsec
# put_short: <= 2K
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 11 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 1K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 11 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 1K
# am_short: <= 2046
# am_bcopy: <= 8254
# am_zcopy: <= 8254, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 1K
# am header: <= 138
# domain: device
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 38
# device num paths: 2
# max eps: inf
# device address: 18 bytes
# iface address: 5 bytes
# error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
# Transport: ud_verbs
# Device: mlx5_bond_0:1
# Type: network
# System device: mlx5_bond_0 (0)
#
# capabilities:
# bandwidth: 2739.46/ppn + 0.00 MB/sec
# latency: 830 nsec
# overhead: 105 nsec
# am_short: <= 116
# am_bcopy: <= 1016
# am_zcopy: <= 1016, up to 5 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 1K
# am header: <= 880
# connection: to ep, to iface
# device priority: 38
# device num paths: 2
# max eps: inf
# device address: 18 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure, ep_check
#
#
# Transport: ud_mlx5
# Device: mlx5_bond_0:1
# Type: network
# System device: mlx5_bond_0 (0)
#
# capabilities:
# bandwidth: 2739.46/ppn + 0.00 MB/sec
# latency: 830 nsec
# overhead: 80 nsec
# am_short: <= 180
# am_bcopy: <= 1016
# am_zcopy: <= 1016, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 1K
# am header: <= 132
# connection: to ep, to iface
# device priority: 38
# device num paths: 2
# max eps: inf
# device address: 18 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure, ep_check
#
#
# Connection manager: rdmacm
# max_conn_priv: 54 bytes
#
# Memory domain: cma
# Component: cma
# register: unlimited, cost: 9 nsec
#
# Transport: cma
# Device: memory
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 11145.00 MB/sec
# latency: 80 nsec
# overhead: 2000 nsec
# put_zcopy: unlimited, up to 16 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 1
# get_zcopy: unlimited, up to 16 iov
# get_opt_zcopy_align: <= 1
# get_align_mtu: <= 1
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 4 bytes
# error handling: peer failure, ep_check
#
#
# Memory domain: knem
# Component: knem
# register: unlimited, cost: 18446744073709551616000000000 nsec
# remote key: 16 bytes
#
# Transport: knem
# Device: memory
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 13862.00/ppn + 0.00 MB/sec
# latency: 80 nsec
# overhead: 2000 nsec
# put_zcopy: unlimited, up to 16 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 1
# get_zcopy: unlimited, up to 16 iov
# get_opt_zcopy_align: <= 1
# get_align_mtu: <= 1
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 0 bytes
# error handling: none
#
@Clownier can you pls try the following:
- remove these env vars:
setenv("UCX_RC_MAX_GET_ZCOPY", "32k", 1);
setenv("UCX_RC_TX_NUM_GET_OPS", "8", 1);
-
check if the server application spends a lot of time in system calls when the long tail delays happen (for example, using strace or top commands)
-
Try
setenv("UCX_RNDV_THRESH", "32k", 1);
andsetenv("UCX_RNDV_THRESH", "inf", 1);
(separately)
Thank you for your advice.
First, let me clarify the question a bit further. Our business request sizes typically range from 4KB to 64KB, with occasional requests reaching 256KB. During testing with different IO sizes, we observed that long tails mainly occur with requests above 32KB. At the same time, as the IO size increases, it leads to a rise in bandwidth, making it difficult for us to isolate whether the issue is caused by the IO size or the increased bandwidth.
Based on this, we conducted tests following your suggestions:
-
After removing the UCX_RC_MAX_GET_ZCOPY and UCX_RC_TX_NUM_GET_OPS configurations, we did not observe any significant changes.
-
We used strace to monitor the system calls on the thread where ucp_worker_progress is executed when long tails occur.
- When there are no long tails: The strace output:
...
11:02:17.115288 futex(0x7f3e28029968, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
11:02:17.115346 futex(0x7f3e28029968, FUTEX_WAKE_PRIVATE, 1) = 0
11:02:18.098154 getrusage(0x1 /* RUSAGE_??? */, {ru_utime={2765, 891513}, ru_stime={0, 363269}, ...}) = 0
11:02:19.098167 getrusage(0x1 /* RUSAGE_??? */, {ru_utime={2766, 892186}, ru_stime={0, 363269}, ...}) = 0
11:02:19.117171 futex(0x7f3e28029968, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
11:02:19.117260 futex(0x7f3e28029968, FUTEX_WAKE_PRIVATE, 1) = 0
11:02:20.098184 getrusage(0x1 /* RUSAGE_??? */, {ru_utime={2767, 892739}, ru_stime={0, 363269}, ...}) = 0
11:02:21.098205 getrusage(0x1 /* RUSAGE_??? */, {ru_utime={2768, 893430}, ru_stime={0, 363269}, ...}) = 0
...
- When long tails occur, the strace output is as follows:
...
11:02:17.003769 gettid() = 12123
11:02:17.003824 clock_gettime(CLOCK_REALTIME, {1720494137, 3838477}) = 0
11:02:17.003883 futex(0x314b714, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x314b710, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
11:02:17.003943 gettid() = 12123
11:02:17.003995 clock_gettime(CLOCK_REALTIME, {1720494137, 4009639}) = 0
11:02:17.004048 futex(0x314b714, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x314b710, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
11:02:17.004105 gettid() = 12123
...
- Additionally, we haven't found a way to observe system calls using top. Could you please provide a relevant reference link for this purpose?
-
We conducted comparative tests with different UCX_RNDV_THRESH configurations set to 4K, 32K, and inf. Compared to the original configuration of 32K, setting it to 4K resulted in a significant improvement in bandwidth for IO sizes of 4K and 16K, but with a decrease in stability. For IO sizes above 32K, there were no noticeable changes. When setting UCX_RNDV_THRESH to inf, we observed a 30% drop in bandwidth and increased jitter, with some improvement in second-level long tails (which we attribute to the decreased bandwidth). However, the bandwidth drop is unacceptable for us.
-
After reading "Collie: Finding Performance Anomalies in RDMA Subsystems," we found that Root Cause 6: "RDMA NIC has potential in-NIC incast/congestion" aligns with our observations. So we temporarily blocked the data transmission between the two modules on the same machine at the business level (i.e., loopback traffic). and noticed a significant reduction in long tails with no change in bandwidth. This leads us to four new questions:
-
a. Is there a way to disable RDMA loopback traffic? Or to force loopback traffic to go through a switch?
-
b. Are there any other solutions to address the increased pause and resulting long tails caused by loopback traffic?
-
c. Does UCX internally utilize IPC mechanisms such as UNIX sockets for inter-process communication, bypassing the network altogether?
-
d. After blocking loopback traffic, we still encounter a small number of long tails. What other suggestions can we try?