parthenon-hpc-lab/parthenon

Cuda IPC issue

Opened this issue · 1 comments

I was made aware of a Cuda IPC related issue when running a simple AthenaPK test case in parallel.
Given that the test case mostly exercises Parthenon base features, I'm opening an issue here to raise awareness.

The test setup works on a single rank and on two ranks but fails on 4 ranks on mesh refinement (so sometime not on the first one but on a subsequent one).
Also disabling ipc via

export UCX_TLS=rc_x,self,sm,gdr_copy,cuda_copy

i.e., removing cuda_ipc from that list also fixes the issue.

Might be worth trying to reproduce with the Parthenon advection-example and on other software stacks.
Realistically I'll first be able to have a closer look in about two weeks so if anyone else has ideas in between go for it! I was not able to reproduce on Frontier so might be a CUDA issue, ping @forrestglines

Steps to reproduce (using NVHPC/Cuda/OpenMPI 23.1/11.7.64/4.1.4 and 23.7/12.2.91/4.15 on A100s on JUWELS Booster) is compile current AthenaPK main and run the advection pgen

# works
$ srun -n 1 ../build-fresh/bin/athenaPK -i ../inputs/advection_3d.in 

# works
$ srun -n 2 ../build-fresh/bin/athenaPK -i ../inputs/advection_3d.in

# FAILS after mesh refinement
$ srun -n 4 ../build-fresh/bin/athenaPK -i ../inputs/advection_3d.in
cycle=51 time=5.2174263355214599e-02 dt=1.0230267049312646e-03 zone-cycles/wsec_step=7.38e+05 wsec_total=2.18e+00 wsec_step=1.22e-01 zone-cycles/wsec=2.36e+05 wsec_AMR=2.61e-01
-------------- New Mesh structure after (de)refinement -------------
Root grid = 4 x 4 x 4 MeshBlocks
Total number of MeshBlocks = 281
Number of physical refinement levels = 2
Number of logical  refinement levels = 4
  Physical level = 0 (logical level = 2): 44 MeshBlocks, cost = 44
  Physical level = 1 (logical level = 3): 149 MeshBlocks, cost = 149
  Physical level = 2 (logical level = 4): 88 MeshBlocks, cost = 88
--------------------------------------------------------------------
[1711532903.925045] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925089] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925093] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925096] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925099] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925102] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925105] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925107] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925110] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925121] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925127] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925133] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925138] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925145] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925150] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925156] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925162] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925167] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925172] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925180] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925187] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925193] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925199] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925206] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925212] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925218] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[jwb0033.juwels:00000] *** An error occurred in MPI_Test
[jwb0033.juwels:00000] *** reported by process [2349416960,2]
[jwb0033.juwels:00000] *** on communicator MPI COMMUNICATOR 4 DUP FROM 0
[jwb0033.juwels:00000] *** MPI_ERR_INTERN: internal error
[jwb0033.juwels:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[jwb0033.juwels:00000] ***    and potentially your MPI job)
[jwb0033.juwels:30564] [0] func:/p/software/juwelsbooster/stages/2024/software/OpenMPI/4.1.5-NVHPC-23.7-CUDA-12/lib/libopen-pal.so.40(opal_backtrace_buffer+0x25) [0x14fac811eb25]
[jwb0033.juwels:30564] [1] func:/p/software/juwelsbooster/stages/2024/software/OpenMPI/4.1.5-NVHPC-23.7-CUDA-12/lib/libmpi.so.40(ompi_mpi_abort+0x8e) [0x14facb63604e]
[jwb0033.juwels:30564] [2] func:/p/software/juwelsbooster/stages/2024/software/OpenMPI/4.1.5-NVHPC-23.7-CUDA-12/lib/libmpi.so.40(ompi_mpi_errors_are_fatal_comm_handler+0x36f) [0x14facb6243ef]
[jwb0033.juwels:30564] [3] func:/p/software/juwelsbooster/stages/2024/software/OpenMPI/4.1.5-NVHPC-23.7-CUDA-12/lib/libmpi.so.40(ompi_errhandler_request_invoke+0x6f0) [0x14facb623f30]
[jwb0033.juwels:30564] [4] func:/p/software/juwelsbooster/stages/2024/software/OpenMPI/4.1.5-NVHPC-23.7-CUDA-12/lib/libmpi.so.40(PMPI_Test+0xb9) [0x14facb670879]
[jwb0033.juwels:30564] [5] func:../build-fresh-24/bin/athenaPK() [0x97c43d]
[jwb0033.juwels:30564] [6] func:../build-fresh-24/bin/athenaPK() [0x4b9bca]
[jwb0033.juwels:30564] [7] func:../build-fresh-24/bin/athenaPK() [0x4b4efa]
[jwb0033.juwels:30564] [8] func:../build-fresh-24/bin/athenaPK() [0x4b489b]
[jwb0033.juwels:30564] [9] func:/usr/lib64/libpthread.so.0(+0xfe67) [0x14facae03e67]
[jwb0033.juwels:30564] [10] func:../build-fresh-24/bin/athenaPK() [0x4b576e]
[jwb0033.juwels:30564] [11] func:../build-fresh-24/bin/athenaPK() [0x4b8a13]
[jwb0033.juwels:30564] [12] func:/p/software/juwelsbooster/stages/2024/software/GCCcore/12.3.0/lib64/libstdc++.so.6(+0xe0a93) [0x14fac898aa93]
[jwb0033.juwels:30564] [13] func:/usr/lib64/libpthread.so.0(+0x81ca) [0x14facadfc1ca]
[jwb0033.juwels:30564] [14] func:/usr/lib64/libc.so.6(clone+0x43) [0x14fac819ce73]
[1711532903.927606] [jwb0033:30662:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.927658] [jwb0033:30662:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.927669] [jwb0033:30662:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.927676] [jwb0033:30662:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.927682] [jwb0033:30662:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.927689] [jwb0033:30662:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.927694] [jwb0033:30662:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.927700] [jwb0033:30662:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.927706] [jwb0033:30662:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.927711] [jwb0033:30662:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.927717] [jwb0033:30662:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.927723] [jwb0033:30662:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[jwb0033.juwels:00000] *** An error occurred in MPI_Test
[jwb0033.juwels:00000] *** reported by process [2349416960,1]
[jwb0033.juwels:00000] *** on communicator MPI COMMUNICATOR 4 DUP FROM 0
[jwb0033.juwels:00000] *** MPI_ERR_INTERN: internal error
[jwb0033.juwels:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[jwb0033.juwels:00000] ***    and potentially your MPI job)
[jwb0033.juwels:30662] [0] func:/p/software/juwelsbooster/stages/2024/software/OpenMPI/4.1.5-NVHPC-23.7-CUDA-12/lib/libopen-pal.so.40(opal_backtrace_buffer+0x25) [0x1461289e6b25]
[jwb0033.juwels:30662] [1] func:/p/software/juwelsbooster/stages/2024/software/OpenMPI/4.1.5-NVHPC-23.7-CUDA-12/lib/libmpi.so.40(ompi_mpi_abort+0x8e) [0x14612befe04e]
[jwb0033.juwels:30662] [2] func:/p/software/juwelsbooster/stages/2024/software/OpenMPI/4.1.5-NVHPC-23.7-CUDA-12/lib/libmpi.so.40(ompi_mpi_errors_are_fatal_comm_handler+0x36f) [0x14612beec3ef]
[jwb0033.juwels:30662] [3] func:/p/software/juwelsbooster/stages/2024/software/OpenMPI/4.1.5-NVHPC-23.7-CUDA-12/lib/libmpi.so.40(ompi_errhandler_request_invoke+0x6f0) [0x14612beebf30]
[jwb0033.juwels:30662] [4] func:/p/software/juwelsbooster/stages/2024/software/OpenMPI/4.1.5-NVHPC-23.7-CUDA-12/lib/libmpi.so.40(PMPI_Test+0xb9) [0x14612bf38879]
[jwb0033.juwels:30662] [5] func:../build-fresh-24/bin/athenaPK() [0x97c43d]
[jwb0033.juwels:30662] [6] func:../build-fresh-24/bin/athenaPK() [0x4b9bca]
[jwb0033.juwels:30662] [7] func:../build-fresh-24/bin/athenaPK() [0x4b4efa]
[jwb0033.juwels:30662] [8] func:../build-fresh-24/bin/athenaPK() [0x4b489b]
[jwb0033.juwels:30662] [9] func:/usr/lib64/libpthread.so.0(+0xfe67) [0x14612b6cbe67]
[jwb0033.juwels:30662] [10] func:../build-fresh-24/bin/athenaPK() [0x4b576e]
[jwb0033.juwels:30662] [11] func:../build-fresh-24/bin/athenaPK() [0x4b8a13]
[jwb0033.juwels:30662] [12] func:/p/software/juwelsbooster/stages/2024/software/GCCcore/12.3.0/lib64/libstdc++.so.6(+0xe0a93) [0x146129252a93]
[jwb0033.juwels:30662] [13] func:/usr/lib64/libpthread.so.0(+0x81ca) [0x14612b6c41ca]
[jwb0033.juwels:30662] [14] func:/usr/lib64/libc.so.6(clone+0x43) [0x146128a64e73]