lattice/quda

infinite hang when autotuning is disabled

Closed this issue · 4 comments

Program hangs without error after output file prints "MAKING PATH TABLES" with QUDA_ENABLE_TUNING set to 0. When I enable tuning and set the cache file location with QUDA_RESOURCE_PATH, the program hangs without error after outputting "cublasCreated successfully". When the program is hanging, it is actively using CPUs, but not utilizing any GPU compute. Wondering if this is an issue for anyone else.

We haven't seen this issue. Hopefully this is easy to fix.

  • Can you tell us how to reproduce this issue?
  • What GPU and compilers are you using?
  • Can you share your cmake configuration command?

Thanks for the quick reply. I'm using two Nvidia A100X GPUs with the compilers and libraries from Nvidia HPC-SDK 24.1, gcc 11.4.0, ubuntu 22.04.4.

I'm running the MILC spectrum code from the NERSC10 Lattice QCD benchmark. My run script is:

export QUDA_ENABLE_TUNING=0
mpirun --mca btl_tcp_if_include ibs2 -np 2 -host ${HOST1},${HOST2} -x LD_LIBRARY_PATH ./ks_spectrum_hisq ./input_4864

cmake command:

cmake
-G "Unix Makefiles"
-DCMAKE_BUILD_TYPE=RELEASE
-DCMAKE_CXX_COMPILER=/opt/nvidia/hpc_sdk/Linux_x86_64/24.1/comm_libs/mpi/bin/mpiCC
-DCMAKE_C_COMPILER=/opt/nvidia/hpc_sdk/Linux_x86_64/24.1/comm_libs/mpi/bin/mpicc
-DCMAKE_Fortran_COMPILER=/opt/nvidia/hpc_sdk/Linux_x86_64/24.1/comm_libs/mpi/bin/mpifort
-DCMAKE_CUDA_COMPILER=/opt/nvidia/hpc_sdk/Linux_x86_64/24.1/compilers/bin/nvcc
-DQUDA_GPU_ARCH=sm_80
-DQUDA_DIRAC_DEFAULT_OFF=ON
-DQUDA_DIRAC_STAGGERED=ON
-DQUDA_FORCE_HISQ=ON
-DQUDA_FORCE_GAUGE=ON
-DQUDA_MPI=ON
-DCMAKE_INSTALL_PREFIX=${QUDA_INSTALL_PREFIX}
../

Thanks for the info @amwe210. I think the next to do is to work out where it's hanging, confirm if it's hanging in QUDA or MILC. On a hanging job, can you attach gdb to it, and get the backtrace?

What network are you running on?

Is this a regression versus a prior known good version of QUDA?

@maddyscientist I was able to find the problem and it was unrelated to the QUDA install. There was an issue with conflicting mpi libraries on my system. Thank you for your assistance. I will close this issue, since QUDA is working as expected.