openucx/ucx

Long-Tail Requests

Clownier opened this issue · 2 comments

Describe the bug

We are using the RDMA transport service provided by UCX under a ROCE v2 network and have observed 1 to 2-second long-tail requests. The frequency of these long-tail requests appears to correlate positively with system pressure and the size of requested IOs. Additionally, we have noticed that the eth's tx_pause counter and bond1's ecn, cnp counters continue to increase even when the traffic volume is relatively low (under 500MB per machine). Is this behavior normal?

image
Our business model is depicted in the attached diagram, where we have encapsulated the Server and Client ends using UCX. The primary APIs used are ucp_tag_send_nb and ucp_tag_recv_nb. Each Server end establishes connections with multiple Client ends. The long-tail issue primarily occurs between the Middle and Tail processes, both of which use the rc_x transport mode.

During the request-response interaction between the Client and Server ends facilitated by UCX's TagMatch, the Client initially sends a request header with a specific Tag to the Server, followed by the corresponding data field for that Tag. Upon receiving the request header, the Server parses the Tag and initiates data reception for that Tag. The Server then sends a Response back to the Client, also in the form of a request header. The request headers utilize a fixed Tag matching scheme, with the first bit set to 1.

After troubleshooting, we have identified that the primary cause of the long-tail issue lies in the Server's inability to receive data. For a duration of approximately 2 seconds, the Server's Progress function can only receive new request headers but fails to receive the data or complete the sending of the Response (i.e., the send callback is not invoked).

Steps to Reproduce

Command line
UCX version used: v1.12.0
UCX configure flags
# configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --disable-params-check --without-java --enable-cma --without-cuda --without-gdrcopy --with-verbs --with-knem --with-rdmacm --without-rocm --without-xpmem --without-fuse3 --without-ugni
Any UCX environment variables used

    setenv("UCX_MAX_EAGER_LANES", "2", 1);
    setenv("UCX_IB_SEG_SIZE", "2k", 1);
    setenv("UCX_RC_RX_QUEUE_LEN", "1024", 1);
    setenv("UCX_RC_MAX_RD_ATOMIC", "16", 1);
    setenv("UCX_RC_ROCE_PATH_FACTOR", "2", 1);
    setenv("UCX_RNDV_THRESH", "32k", 1);
    setenv("UCX_IB_TRAFFIC_CLASS", "166", 1);
    setenv("UCX_SOCKADDR_CM_ENABLE", "y", 1);
    setenv("UCX_RC_MAX_GET_ZCOPY", "32k", 1);
    setenv("UCX_RC_TX_NUM_GET_OPS", "8", 1);
    setenv("UCX_RC_TX_NUM_GET_BYTES", "256k", 1);
    setenv("UCX_RC_TX_CQ_MODERATION", "1", 1);
    setenv("UCX_HANDLE_ERRORS", "", 1);
    setenv("UCX_IB_FORK_INIT", "n", 1);

Setup and versions

  • OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
    CentOS Linux release 7.2.1511 (Core) 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
  • For RDMA/IB/RoCE related issues:
    • Driver version:
    • rdma-core-52mlnx1-1.52104.x86_64
    • MLNX_OFED_LINUX-5.2-1.0.4.0:
          hca_id: mlx5_bond_0
          transport:                      InfiniBand (0)
          fw_ver:                         16.27.1016
          node_guid:                      b8ce:f603:00e9:4d5e
          sys_image_guid:                 b8ce:f603:00e9:4d5e
          vendor_id:                      0x02c9
          vendor_part_id:                 4119
          hw_ver:                         0x0
          board_id:                       MT_0000000080
          phys_port_cnt:                  1
          max_mr_size:                    0xffffffffffffffff
          page_size_cap:                  0xfffffffffffff000
          max_qp:                         262144
          max_qp_wr:                      32768
          device_cap_flags:               0xed721c36
                                          BAD_PKEY_CNTR
                                          BAD_QKEY_CNTR
                                          AUTO_PATH_MIG
                                          CHANGE_PHY_PORT
                                          PORT_ACTIVE_EVENT
                                          SYS_IMAGE_GUID
                                          RC_RNR_NAK_GEN
                                          MEM_WINDOW
                                          XRC
                                          MEM_MGT_EXTENSIONS
                                          MEM_WINDOW_TYPE_2B
                                          RAW_IP_CSUM
                                          MANAGED_FLOW_STEERING
                                          Unknown flags: 0xC8400000
          max_sge:                        30
          max_sge_rd:                     30
          max_cq:                         16777216
          max_cqe:                        4194303
          max_mr:                         16777216
          max_pd:                         16777216
          max_qp_rd_atom:                 16
          max_ee_rd_atom:                 0
          max_res_rd_atom:                4194304
          max_qp_init_rd_atom:            16
          max_ee_init_rd_atom:            0
          atomic_cap:                     ATOMIC_HCA (1)
          max_ee:                         0
          max_rdd:                        0
          max_mw:                         16777216
          max_raw_ipv6_qp:                0
          max_raw_ethy_qp:                0
          max_mcast_grp:                  2097152
          max_mcast_qp_attach:            240
          max_total_mcast_qp_attach:      503316480
          max_ah:                         2147483647
          max_fmr:                        0
          max_srq:                        8388608
          max_srq_wr:                     32767
          max_srq_sge:                    31
          max_pkeys:                      128
          local_ca_ack_delay:             16
          general_odp_caps:
                                          ODP_SUPPORT
                                          ODP_SUPPORT_IMPLICIT
          rc_odp_caps:
                                          SUPPORT_SEND
                                          SUPPORT_RECV
                                          SUPPORT_WRITE
                                          SUPPORT_READ
                                          SUPPORT_SRQ
          uc_odp_caps:
                                          NO SUPPORT
          ud_odp_caps:
                                          SUPPORT_SEND
          xrc_odp_caps:
                                          SUPPORT_SEND
                                          SUPPORT_WRITE
                                          SUPPORT_READ
                                          SUPPORT_SRQ
          completion timestamp_mask:                      0x7fffffffffffffff
          hca_core_clock:                 156250kHZ
          raw packet caps:
                                          C-VLAN stripping offload
                                          Scatter FCS offload
                                          IP csum offload
                                          Delay drop
          device_cap_flags_ex:            0x30000055ED721C36
                                          RAW_SCATTER_FCS
                                          PCI_WRITE_END_PADDING
                                          Unknown flags: 0x3000004100000000
          tso_caps:
                  max_tso:                        262144
                  supported_qp:
                                          SUPPORT_RAW_PACKET
          rss_caps:
                  max_rwq_indirection_tables:                     65536
                  max_rwq_indirection_table_size:                 2048
                  rx_hash_function:                               0x1
                  rx_hash_fields_mask:                            0x800000FF
                  supported_qp:
                                          SUPPORT_RAW_PACKET
          max_wq_type_rq:                 8388608
          packet_pacing_caps:
                  qp_rate_limit_min:      1kbps
                  qp_rate_limit_max:      25000000kbps
                  supported_qp:
                                          SUPPORT_RAW_PACKET
          tag matching not supported
    
          cq moderation caps:
                  max_cq_count:   65535
                  max_cq_period:  4095 us
    
          maximum available device memory:        131072Bytes
    
                  port:   1
                          state:                  PORT_ACTIVE (4)
                          max_mtu:                4096 (5)
                          active_mtu:             1024 (3)
                          sm_lid:                 0
                          port_lid:               0
                          port_lmc:               0x00
                          link_layer:             Ethernet
                          max_msg_sz:             0x40000000
                          port_cap_flags:         0x04010000
                          port_cap_flags2:        0x0000
                          max_vl_num:             invalid value (0)
                          bad_pkey_cntr:          0x0
                          qkey_viol_cntr:         0x0
                          sm_sl:                  0
                          pkey_tbl_len:           1
                          gid_tbl_len:            256
                          subnet_timeout:         0
                          init_type_reply:        0
                          active_width:           1X (1)
                          active_speed:           25.0 Gbps (32)
                          phys_state:             LINK_UP (5)
                          GID[  0]:               fe80:0000:0000:0000:bace:f6ff:fee9:4d5e, RoCE v1
                          GID[  1]:               fe80::bace:f6ff:fee9:4d5e, RoCE v2
                          GID[  2]:               fe80:0000:0000:0000:bace:f6ff:fee9:4d5e, RoCE v1
                          GID[  3]:               fe80::bace:f6ff:fee9:4d5e, RoCE v2
                          GID[  4]:               0000:0000:0000:0000:0000:ffff:0a4e:0588, RoCE v1
                          GID[  5]:               ::ffff:10.78.5.136, RoCE v2
    
    

Additional information (depending on the issue)

  • Output of ucx_info -d to show transports and devices recognized by UCX
#
# Memory domain: posix
#     Component: posix
#             allocate: <= 64967552K
#           remote key: 24 bytes
#           rkey_ptr is supported
#
#      Transport: posix
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 12179.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: sysv
#     Component: sysv
#             allocate: unlimited
#           remote key: 12 bytes
#           rkey_ptr is supported
#
#      Transport: sysv
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 12179.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: self
#     Component: self
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#
#      Transport: self
#         Device: memory0
#           Type: loopback
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 6911.00 MB/sec
#              latency: 0 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 8K
#             am_bcopy: <= 8K
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: tcp
#     Component: tcp
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#
#      Transport: tcp
#         Device: lo
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 11.91/ppn + 0.00 MB/sec
#              latency: 10960 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 1
#     device num paths: 1
#              max eps: 256
#       device address: 18 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#      Transport: tcp
#         Device: bond1
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 5658.18/ppn + 0.00 MB/sec
#              latency: 5212 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 1
#     device num paths: 1
#              max eps: 256
#       device address: 6 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#
# Connection manager: tcp
#      max_conn_priv: 2064 bytes
#
# Memory domain: mlx5_bond_0
#     Component: ib
#             register: unlimited, cost: 180 nsec
#           remote key: 8 bytes
#           local memory handle is required for zcopy
#
#      Transport: rc_verbs
#         Device: mlx5_bond_0:1
#           Type: network
#  System device: mlx5_bond_0 (0)
#
#      capabilities:
#            bandwidth: 2739.46/ppn + 0.00 MB/sec
#              latency: 800 + 1.000 * N nsec
#             overhead: 75 nsec
#            put_short: <= 124
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 5 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 1K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 5 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 1K
#             am_short: <= 123
#             am_bcopy: <= 8255
#             am_zcopy: <= 8255, up to 4 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 1K
#            am header: <= 127
#               domain: device
#           atomic_add: 64 bit
#          atomic_fadd: 64 bit
#         atomic_cswap: 64 bit
#           connection: to ep
#      device priority: 38
#     device num paths: 2
#              max eps: 256
#       device address: 18 bytes
#           ep address: 5 bytes
#       error handling: peer failure, ep_check
#
#
#      Transport: rc_mlx5
#         Device: mlx5_bond_0:1
#           Type: network
#  System device: mlx5_bond_0 (0)
#
#      capabilities:
#            bandwidth: 2739.46/ppn + 0.00 MB/sec
#              latency: 800 + 1.000 * N nsec
#             overhead: 40 nsec
#            put_short: <= 2K
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 14 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 1K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 14 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 1K
#             am_short: <= 2046
#             am_bcopy: <= 8254
#             am_zcopy: <= 8254, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 1K
#            am header: <= 186
#               domain: device
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to ep
#      device priority: 38
#     device num paths: 2
#              max eps: 256
#       device address: 18 bytes
#           ep address: 7 bytes
#       error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
#      Transport: dc_mlx5
#         Device: mlx5_bond_0:1
#           Type: network
#  System device: mlx5_bond_0 (0)
#
#      capabilities:
#            bandwidth: 2739.46/ppn + 0.00 MB/sec
#              latency: 860 nsec
#             overhead: 40 nsec
#            put_short: <= 2K
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 11 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 1K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 11 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 1K
#             am_short: <= 2046
#             am_bcopy: <= 8254
#             am_zcopy: <= 8254, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 1K
#            am header: <= 138
#               domain: device
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 38
#     device num paths: 2
#              max eps: inf
#       device address: 18 bytes
#        iface address: 5 bytes
#       error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
#      Transport: ud_verbs
#         Device: mlx5_bond_0:1
#           Type: network
#  System device: mlx5_bond_0 (0)
#
#      capabilities:
#            bandwidth: 2739.46/ppn + 0.00 MB/sec
#              latency: 830 nsec
#             overhead: 105 nsec
#             am_short: <= 116
#             am_bcopy: <= 1016
#             am_zcopy: <= 1016, up to 5 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 1K
#            am header: <= 880
#           connection: to ep, to iface
#      device priority: 38
#     device num paths: 2
#              max eps: inf
#       device address: 18 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure, ep_check
#
#
#      Transport: ud_mlx5
#         Device: mlx5_bond_0:1
#           Type: network
#  System device: mlx5_bond_0 (0)
#
#      capabilities:
#            bandwidth: 2739.46/ppn + 0.00 MB/sec
#              latency: 830 nsec
#             overhead: 80 nsec
#             am_short: <= 180
#             am_bcopy: <= 1016
#             am_zcopy: <= 1016, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 1K
#            am header: <= 132
#           connection: to ep, to iface
#      device priority: 38
#     device num paths: 2
#              max eps: inf
#       device address: 18 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure, ep_check
#
#
# Connection manager: rdmacm
#      max_conn_priv: 54 bytes
#
# Memory domain: cma
#     Component: cma
#             register: unlimited, cost: 9 nsec
#
#      Transport: cma
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 11145.00 MB/sec
#              latency: 80 nsec
#             overhead: 2000 nsec
#            put_zcopy: unlimited, up to 16 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 16 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 4 bytes
#       error handling: peer failure, ep_check
#
#
# Memory domain: knem
#     Component: knem
#             register: unlimited, cost: 18446744073709551616000000000 nsec
#           remote key: 16 bytes
#
#      Transport: knem
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 13862.00/ppn + 0.00 MB/sec
#              latency: 80 nsec
#             overhead: 2000 nsec
#            put_zcopy: unlimited, up to 16 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 16 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 0 bytes
#       error handling: none
#

@Clownier can you pls try the following:

  1. remove these env vars:
    setenv("UCX_RC_MAX_GET_ZCOPY", "32k", 1);
    setenv("UCX_RC_TX_NUM_GET_OPS", "8", 1);
  1. check if the server application spends a lot of time in system calls when the long tail delays happen (for example, using strace or top commands)

  2. Try setenv("UCX_RNDV_THRESH", "32k", 1); and setenv("UCX_RNDV_THRESH", "inf", 1); (separately)

Thank you for your advice.
First, let me clarify the question a bit further. Our business request sizes typically range from 4KB to 64KB, with occasional requests reaching 256KB. During testing with different IO sizes, we observed that long tails mainly occur with requests above 32KB. At the same time, as the IO size increases, it leads to a rise in bandwidth, making it difficult for us to isolate whether the issue is caused by the IO size or the increased bandwidth.
Based on this, we conducted tests following your suggestions:

  1. After removing the UCX_RC_MAX_GET_ZCOPY and UCX_RC_TX_NUM_GET_OPS configurations, we did not observe any significant changes.

  2. We used strace to monitor the system calls on the thread where ucp_worker_progress is executed when long tails occur.

  • When there are no long tails: The strace output:
...
11:02:17.115288 futex(0x7f3e28029968, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
11:02:17.115346 futex(0x7f3e28029968, FUTEX_WAKE_PRIVATE, 1) = 0
11:02:18.098154 getrusage(0x1 /* RUSAGE_??? */, {ru_utime={2765, 891513}, ru_stime={0, 363269}, ...}) = 0
11:02:19.098167 getrusage(0x1 /* RUSAGE_??? */, {ru_utime={2766, 892186}, ru_stime={0, 363269}, ...}) = 0
11:02:19.117171 futex(0x7f3e28029968, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
11:02:19.117260 futex(0x7f3e28029968, FUTEX_WAKE_PRIVATE, 1) = 0
11:02:20.098184 getrusage(0x1 /* RUSAGE_??? */, {ru_utime={2767, 892739}, ru_stime={0, 363269}, ...}) = 0
11:02:21.098205 getrusage(0x1 /* RUSAGE_??? */, {ru_utime={2768, 893430}, ru_stime={0, 363269}, ...}) = 0
...
  • When long tails occur, the strace output is as follows:
...
11:02:17.003769 gettid()                = 12123
11:02:17.003824 clock_gettime(CLOCK_REALTIME, {1720494137, 3838477}) = 0
11:02:17.003883 futex(0x314b714, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x314b710, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
11:02:17.003943 gettid()                = 12123
11:02:17.003995 clock_gettime(CLOCK_REALTIME, {1720494137, 4009639}) = 0
11:02:17.004048 futex(0x314b714, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x314b710, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
11:02:17.004105 gettid()                = 12123
...
  • Additionally, we haven't found a way to observe system calls using top. Could you please provide a relevant reference link for this purpose?
  1. We conducted comparative tests with different UCX_RNDV_THRESH configurations set to 4K, 32K, and inf. Compared to the original configuration of 32K, setting it to 4K resulted in a significant improvement in bandwidth for IO sizes of 4K and 16K, but with a decrease in stability. For IO sizes above 32K, there were no noticeable changes. When setting UCX_RNDV_THRESH to inf, we observed a 30% drop in bandwidth and increased jitter, with some improvement in second-level long tails (which we attribute to the decreased bandwidth). However, the bandwidth drop is unacceptable for us.

  2. After reading "Collie: Finding Performance Anomalies in RDMA Subsystems," we found that Root Cause 6: "RDMA NIC has potential in-NIC incast/congestion" aligns with our observations. So we temporarily blocked the data transmission between the two modules on the same machine at the business level (i.e., loopback traffic). and noticed a significant reduction in long tails with no change in bandwidth. This leads us to four new questions:

  • a. Is there a way to disable RDMA loopback traffic? Or to force loopback traffic to go through a switch?

  • b. Are there any other solutions to address the increased pause and resulting long tails caused by loopback traffic?

  • c. Does UCX internally utilize IPC mechanisms such as UNIX sockets for inter-process communication, bypassing the network altogether?

  • d. After blocking loopback traffic, we still encounter a small number of long tails. What other suggestions can we try?