Unexpected modprobe processes on RHEL9 CPU-only nodes using OpenMPI 5 with UCX built with CUDA
Closed this issue · 1 comments
ZQyou commented
Describe the bug
I am not sure if this is a bug related to UCX, but I would like to understand more about it. I built OpenMPI 5 with UCX from HPC-X, where the UCX libraries were built with CUDA. When I ran any MPI application with OpenMPI on CPU nodes, I observed that there were modprobe processes running simultaneously with the MPI executable, occupying the allocated CPUs for minutes. The modprobe process is trying to load GPU modules. As a result, the actual job could only complete after the modprobe processes had finished. This issue occurs whenever MPI executables are launched and is only observed on RHEL9 but not on our other clusters, which run RHEL7.
Steps to Reproduce
- Command line
mpirun <any_mpi_executable>
- UCX version used (from github branch XX or release YY) + UCX configure flags (can be checked by
ucx_info -v
)
$ ucx_info -v
# Library version: 1.16.0
# Library path: /apps/hpcx/2.17.1/ucx/mt/lib/libucs.so.0
# API headers version: 1.16.0
# Git branch '', revision 02432d3
# Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --with-knem --with-xpmem=/hpc/local/oss/xpmem/v2.7.1 --without-java --enable-devel-headers --with-fuse3-static --with-cuda=/hpc/local/oss/cuda12.2.2 --with-gdrcopy --prefix=/build-result/hpcx-v2.17.1-gcc-mlnx_ofed-redhat9-cuda12-x86_64/ucx/mt --with-bfd=/hpc/local/oss/binutils/2.37
- Any UCX environment variables used
$ env |grep ^UCX
UCX_NET_DEVICES=mlx5_0:1
Setup and versions
- OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
$ cat /etc/redhat-release
Red Hat Enterprise Linux release 9.2 (Plow)
$ uname -a
Linux c1001.ten.osc.edu 5.14.0-284.62.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Apr 5 09:44:49 EDT 2024 x86_64 x86_64 x86_64 GNU/Linux
- For RDMA/IB/RoCE related issues:
$ ofed_info -s
MLNX_OFED_LINUX-5.8-3.0.7.0.202404261301:
$ ibstat
CA 'mlx5_0'
CA type: MT4123
Number of ports: 1
Firmware version: 20.36.1010
Hardware version: 0
Node GUID: 0xa088c20300c6e2ea
System image GUID: 0xa088c20300c6e2ea
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 1750
LMC: 0
SM lid: 1861
Capability mask: 0xa651e848
Port GUID: 0xa088c20300c6e2ea
Link layer: InfiniBand
Additional information (depending on the issue)
- OpenMPI version: 5.0.2