openucx/ucx

Unexpected modprobe processes on RHEL9 CPU-only nodes using OpenMPI 5 with UCX built with CUDA

Closed this issue · 1 comments

Describe the bug

I am not sure if this is a bug related to UCX, but I would like to understand more about it. I built OpenMPI 5 with UCX from HPC-X, where the UCX libraries were built with CUDA. When I ran any MPI application with OpenMPI on CPU nodes, I observed that there were modprobe processes running simultaneously with the MPI executable, occupying the allocated CPUs for minutes. The modprobe process is trying to load GPU modules. As a result, the actual job could only complete after the modprobe processes had finished. This issue occurs whenever MPI executables are launched and is only observed on RHEL9 but not on our other clusters, which run RHEL7.

Steps to Reproduce

  • Command line
mpirun <any_mpi_executable>
  • UCX version used (from github branch XX or release YY) + UCX configure flags (can be checked by ucx_info -v)
$ ucx_info -v
# Library version: 1.16.0
# Library path: /apps/hpcx/2.17.1/ucx/mt/lib/libucs.so.0
# API headers version: 1.16.0
# Git branch '', revision 02432d3
# Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --with-knem --with-xpmem=/hpc/local/oss/xpmem/v2.7.1 --without-java --enable-devel-headers --with-fuse3-static --with-cuda=/hpc/local/oss/cuda12.2.2 --with-gdrcopy --prefix=/build-result/hpcx-v2.17.1-gcc-mlnx_ofed-redhat9-cuda12-x86_64/ucx/mt --with-bfd=/hpc/local/oss/binutils/2.37
  • Any UCX environment variables used
$ env |grep ^UCX
UCX_NET_DEVICES=mlx5_0:1

Setup and versions

  • OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
$ cat /etc/redhat-release
Red Hat Enterprise Linux release 9.2 (Plow)
$ uname -a
Linux c1001.ten.osc.edu 5.14.0-284.62.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Apr 5 09:44:49 EDT 2024 x86_64 x86_64 x86_64 GNU/Linux
  • For RDMA/IB/RoCE related issues:
$ ofed_info -s
MLNX_OFED_LINUX-5.8-3.0.7.0.202404261301:
$ ibstat
CA 'mlx5_0'
        CA type: MT4123
        Number of ports: 1
        Firmware version: 20.36.1010
        Hardware version: 0
        Node GUID: 0xa088c20300c6e2ea
        System image GUID: 0xa088c20300c6e2ea
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 1750
                LMC: 0
                SM lid: 1861
                Capability mask: 0xa651e848
                Port GUID: 0xa088c20300c6e2ea
                Link layer: InfiniBand

Additional information (depending on the issue)

  • OpenMPI version: 5.0.2

Hi @ZQyou

It does not appear to be a UCX issue. Please try to look what is the command line of these processes, what launches them (parent) and why they take so long to complete.

If after that you are convinced it has anything to do with UCX, let us know.