AMReX-Astro/Microphysics

ROCm memory issues affecting Microphysics codes

BenWibking opened this issue · 17 comments

Previously tracked as AMReX-Codes/amrex#3623.

Reproducer:

git clone https://github.com/AMReX-Astro/Microphysics.git
cd Microphysics/unit_test/burn_cell
export AMREX_HOME=/path/to/amrex
export AMREX_AMD_ARCH=gfx90a:xnack+
export HSA_XNACK=1
export LD_LIBRARY_PATH=/opt/rocm-5.7.0/llvm/lib/clang/17.0.0/lib/linux:$LD_LIBRARY_PATH
make USE_HIP=TRUE CXXFLAGS="-std=c++17 -m64 -fgpu-rdc --offload-arch=gfx90a:xnack+ -pthread -g -O3 -munsafe-fp-atomics -fsanitize=address -shared-libsan" LDFLAGS="-fsanitize=address -shared-libsan" -j16
./main3d.hip.HIP.ex inputs_vode_example

Error message:

==1548068==ERROR: AddressSanitizer: global-buffer-overflow on address 0x0000020f0b48 at pc 0x7f5b9dea5ea7 bp 0x7ffe0c5b6de0 sp 0x7ffe0c5b65a0
READ of size 32 at 0x0000020f0b48 thread T0
    #0 0x7f5b9dea5ea6 in __interceptor_memcpy (/opt/rocm-5.7.0/llvm/lib/clang/17.0.0/lib/linux/libclang_rt.asan-x86_64.so+0xa5ea6) (BuildId: e2f6676d7d0ade0de2c4ac32fa5856892b18b70a
)
    #1 0x7f5b997440a9  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x3440a9) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #2 0x7f5b997462f6  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x3462f6) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #3 0x7f5b997465a6  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x3465a6) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #4 0x7f5b99712434  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x312434) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #5 0x7f5b996dcc53  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x2dcc53) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #6 0x7f5b995835e9  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x1835e9) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #7 0x7f5b99489c0e  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x89c0e) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #8 0x7f5b995e650e  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x1e650e) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #9 0x7f5b99610bd9  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x210bd9) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #10 0x7f5b995e6f91  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x1e6f91) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #11 0x7f5b995f13e7 in hipLaunchKernel (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x1f13e7) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #12 0xa49041 in std::enable_if<MaybeDeviceRunnable<(anonymous namespace)::ResizeRandomSeed(unsigned long)::'lambda'(int)>::value, void>::type amrex::ParallelFor<256, int, (anony
mous namespace)::ResizeRandomSeed(unsigned long)::'lambda'(int), void>(amrex::Gpu::KernelInfo const&, int, (anonymous namespace)::ResizeRandomSeed(unsigned long)::'lambda'(int)&&) /
home/bwibking/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H:878:5
    #13 0xa49041 in void amrex::ParallelFor<int, (anonymous namespace)::ResizeRandomSeed(unsigned long)::'lambda'(int), void>(int, (anonymous namespace)::ResizeRandomSeed(unsigned long)::'lambda'(int)&&) /home/bwibking/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H:1457:5
    #14 0xa49041 in (anonymous namespace)::ResizeRandomSeed(unsigned long) /home/bwibking/amrex/Src/Base/AMReX_Random.cpp:54:5
    #15 0xa49041 in amrex::InitRandom(unsigned long, int, unsigned long) /home/bwibking/amrex/Src/Base/AMReX_Random.cpp:104:5
    #16 0x987586 in amrex::Initialize(int&, char**&, bool, int, std::function<void ()> const&, std::ostream&, std::ostream&, void (*)(char const*)) /home/bwibking/amrex/Src/Base/AMReX.cpp:625:5
    #17 0x908243 in main /home/bwibking/Microphysics/unit_test/burn_cell/main.cpp:19:3
    #18 0x7f5b98c3feaf in __libc_start_call_main (/lib64/libc.so.6+0x3feaf) (BuildId: b39d468aead6d9ede227751ffe093da287488648)
    #19 0x7f5b98c3ff5f in __libc_start_main@GLIBC_2.2.5 (/lib64/libc.so.6+0x3ff5f) (BuildId: b39d468aead6d9ede227751ffe093da287488648)
    #20 0x8dc8c4 in _start (/home/bwibking/Microphysics/unit_test/burn_cell/main3d.hip.HIP.ex+0x8dc8c4)

0x0000020f0b48 is located 56 bytes before global variable 'helmholtz::itmax' defined in '../../EOS/helmholtz/actual_eos_data.cpp' (0x20f0b80) of size 8
0x0000020f0b48 is located 24 bytes before global variable 'helmholtz::input_is_constant' defined in '../../EOS/helmholtz/actual_eos_data.cpp' (0x20f0b60) of size 8
0x0000020f0b48 is located 0 bytes after global variable 'helmholtz::do_coulomb' defined in '../../EOS/helmholtz/actual_eos_data.cpp' (0x20f0b40) of size 8
SUMMARY: AddressSanitizer: global-buffer-overflow (/opt/rocm-5.7.0/llvm/lib/clang/17.0.0/lib/linux/libclang_rt.asan-x86_64.so+0xa5ea6) (BuildId: e2f6676d7d0ade0de2c4ac32fa5856892b18b70a) in __interceptor_memcpy

According the Weiqun, the ASAN error is a false positive.

The {Castro, Quokka} production simulations crash and or produce error messages like this:

Memory access fault by GPU node-8 (Agent handle: 0x2975b60) on address 0x800033773000. Reason: Unknown.

(See: AMReX-Astro/Castro#2569 and quokka-astro/quokka#447.)

In all cases, the errors are not seen on host-only builds or NVIDIA GPUs.

I seem to recall seeing different behaviors when compiled in debug mode, which made me suspect it is a compiler issue.

I seem to recall seeing different behaviors when compiled in debug mode, which made me suspect it is a compiler issue.

Ah, that's an interesting clue. @psharda you were going to try this, right? Did it ever finish building?

Here's a simple test that generates a memory issue with ROCm 5.7.0:

module load cpe/23.09
module load rocm/5.7.0
module load PrgEnv-gnu craype-accel-amd-gfx90a cray-mpich

cd Microphysics/unit_test/test_react
make NETWORK_DIR=subch_simple USE_HIP=TRUE COMP=gnu -j 4

then run on a single GPU, using the inputs_aprox13 inputs file

The output is:

Initializing AMReX (23.12-11-g064db4eaa599)...
Initializing HIP...
HIP initialized with 1 device.
AMReX (23.12-11-g064db4eaa599) initialized
reading extern runtime parameters ...
reading in network electron-capture / beta-decay tables...
Memory access fault by GPU node-4 (Agent handle: 0x1f677b0) on address 0x7fffd6ce5000. Reason: Unknown.
SIGABRT
See Backtrace.0 file for details
srun: error: frontier05193: task 0: Exited with exit code 1
srun: Terminating StepId=1533275.0

I hestiate to ask... does this compile in less than a Hubble time in debug mode?

with DEBUG=TRUE, I get:

:0:rocdevice.cpp            :2692: 719740276446 us: [pid:85303 tid:0x7fffde461700] Callback: Queue 0x7ffeaba00000 aborting w
ith error : HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access memory beyond the largest legal addres
s. code: 0x29

and this runs fine with ROCm 5.3.0

yut23 commented

test_react appears to work fine with ROCm 5.4.0

with rocgdb, I get:

guration: Returned hipSuccess : 
:3:hip_module.cpp           :678 : 298664095624 us: [pid:16441 tid:0x7fffed9cda80]  hipLaunchKernel ( 0x221e30, {4,1,1}, {256,1,1}, 0x7fffffff3a10, 0, stream:0x87d36a0 ) 
:3:rocvirtual.cpp           :783 : 298664095630 us: [pid:16441 tid:0x7fffed9cda80] Arg0:   = val:140648402845968
:3:rocvirtual.cpp           :2897: 298664095632 us: [pid:16441 tid:0x7fffed9cda80] ShaderName : _ZN5amrex13launch_globalILi256EZNS_6launchILi256EZNS_9ReduceOpsIJNS_11ReduceOpMaxEEE4evalINS_10ReduceDataIJNS_10ValLocPairIi6burn_tEEEEEZNS4_4evalINS_8FabArrayINS_9FArrayBoxEEESA_Z9main_mainvEUliiiiE_EENSt9enable_ifIXaasr10IsFabArrayIT_EE5valuesr10IsCallableIT1_iiiiEE5valueEvE4typeERKSH_RKNS_7IntVectERT0_OSI_EUliiiE_EEvRKNS_3BoxERSH_OSQ_EUlvE_EEvimP12ihipStream_tSY_EUlvE_EEvSQ_.intern.14460905eb7cb0a1
:3:hip_module.cpp           :679 : 298664095639 us: [pid:16441 tid:0x7fffed9cda80] hipLaunchKernel: Returned hipSuccess : 
:3:hip_error.cpp            :27  : 298664095641 us: [pid:16441 tid:0x7fffed9cda80]  hipGetLastError (  ) 
:3:hip_error.cpp            :27  : 298664095644 us: [pid:16441 tid:0x7fffed9cda80]  hipGetLastError (  ) 
:3:hip_stream.cpp           :451 : 298664095648 us: [pid:16441 tid:0x7fffed9cda80]  hipStreamSynchronize ( stream:0x87d36a0 ) 
:3:rocdevice.cpp            :2651: 298664095650 us: [pid:16441 tid:0x7fffed9cda80] No HW event
:3:rocvirtual.hpp           :67  : 298664095653 us: [pid:16441 tid:0x7fffed9cda80] Host active wait for Signal = (0x7fffcbee4000) for -1 ns
Memory access fault by GPU node-4 (Agent handle: 0x42bfbc0) on address 0x7ff7e03d5000. Reason: Unknown.

Thread 2 "main3d.hip.x86-" hit Breakpoint 1, 0x00007fffe80f81de in abort () from /lib64/libc.so.6
(gdb) interrupt
(gdb) 
Thread 1 "main3d.hip.x86-" stopped.
0x00007fffdf6769f9 in ?? () from /opt/rocm-5.7.0/lib/libhsa-runtime64.so.1
bt
#0  0x00007fffdf6769f9 in ?? () from /opt/rocm-5.7.0/lib/libhsa-runtime64.so.1
#1  0x00007fffdf67684a in ?? () from /opt/rocm-5.7.0/lib/libhsa-runtime64.so.1
#2  0x00007fffdf669fa9 in ?? () from /opt/rocm-5.7.0/lib/libhsa-runtime64.so.1
#3  0x00007fffe9305793 in ?? () from /opt/rocm-5.7.0/lib/libamdhip64.so.5
#4  0x00007fffe92fc318 in ?? () from /opt/rocm-5.7.0/lib/libamdhip64.so.5
#5  0x00007fffe92ffcbf in ?? () from /opt/rocm-5.7.0/lib/libamdhip64.so.5
#6  0x00007fffe9301a03 in ?? () from /opt/rocm-5.7.0/lib/libamdhip64.so.5
#7  0x00007fffe92ff225 in ?? () from /opt/rocm-5.7.0/lib/libamdhip64.so.5
#8  0x00007fffe92d330b in ?? () from /opt/rocm-5.7.0/lib/libamdhip64.so.5
#9  0x00007fffe92d3920 in ?? () from /opt/rocm-5.7.0/lib/libamdhip64.so.5
#10 0x00007fffe92d39cc in ?? () from /opt/rocm-5.7.0/lib/libamdhip64.so.5
#11 0x00007fffe92d6b28 in ?? () from /opt/rocm-5.7.0/lib/libamdhip64.so.5
#12 0x00007fffe9239503 in ?? () from /opt/rocm-5.7.0/lib/libamdhip64.so.5
#13 0x00007fffe923992c in hipStreamSynchronize () from /opt/rocm-5.7.0/lib/libamdhip64.so.5
#14 0x0000000002f265c6 in amrex::Gpu::Device::streamSynchronize ()
    at /ccs/home/zingale/amrex/Src/Base/AMReX_GpuDevice.cpp:613
#15 0x0000000002fa45ec in amrex::Gpu::streamSynchronize ()
    at /ccs/home/zingale/amrex/Src/Base/AMReX_GpuDevice.H:241
#16 amrex::MFIter::Finalize (this=0x7fffffff3a70)
    at /ccs/home/zingale/amrex/Src/Base/AMReX_MFIter.cpp:242
#17 0x0000000002fa456c in amrex::MFIter::~MFIter (this=0x4292690)
    at /ccs/home/zingale/amrex/Src/Base/AMReX_MFIter.cpp:212
#18 0x0000000002ea2205 in amrex::ReduceOps<amrex::ReduceOpMax>::eval<amrex::FabArray<amrex::FArrayBox>, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >, main_main()::{lambda(int, int, int, int)#1}>(amrex::FabArray<amrex::FArrayBox> const&, amrex::IntVect const&, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >&, main_main()::{lambda(int, int, int, int)#1}&&) (this=<optimized out>, mf=..., 
    nghost=..., reduce_data=..., f=...) at /ccs/home/zingale/amrex/Src/Base/AMReX_Reduce.H:453
#19 amrex::ParReduce<amrex::ReduceOpMax, amrex::ValLocPair<int, burn_t>, amrex::FArrayBox, main_main()::{lambda(int, int, int, int)#1}, void>(amrex::TypeList<amrex::ReduceOpMax>, amrex::TypeList<amrex--Type <RET> for more, q to quit, c to continue without paging--
::ValLocPair<int, burn_t> >, amrex::FabArray<amrex::FArrayBox> const&, amrex::IntVect const&, main_main()::{lambda(int, int, int, int)#1}&&) (fa=..., nghost=..., operation_list=..., type_list=..., 
    f=...) at /ccs/home/zingale/amrex/Src/Base/AMReX_ParReduce.H:103
#20 amrex::ParReduce<amrex::ReduceOpMax, amrex::ValLocPair<int, burn_t>, amrex::FArrayBox, main_main()::{lambda(int, int, int, int)#1}, void>(amrex::TypeList<amrex::ReduceOpMax>, amrex::TypeList<amrex::ValLocPair<int, burn_t> >, amrex::FabArray<amrex::FArrayBox> const&, main_main()::{lambda(int, int, int, int)#1}&&) (fa=..., operation_list=..., type_list=..., f=...)
    at /ccs/home/zingale/amrex/Src/Base/AMReX_ParReduce.H:288
#21 main_main () at main.cpp:203
#22 0x0000000002ea0d41 in main (argc=<optimized out>, argv=<optimized out>) at main.cpp:26```
yut23 commented

Here's a backtrace from inside a thread:

#0  0x00007ff7b0c54630 in dgesl<23> (a1=..., pivot1=..., b1=...) at ../../util/linpack.H:24
#1  dvnlsd<amrex::Array1D<short, 1, 23>, burn_t, dvode_t<23> > (pivot=..., NFLAG=<optimized out>, state=..., vstate=...) at ../../integration/VODE/vode_dvnlsd.H:117                                                                           
#2  dvstep<burn_t, dvode_t<23> > (state=..., vstate=...) at ../../integration/VODE/vode_dvstep.H:177
#3  dvode<burn_t, dvode_t<23> > (state=..., vstate=...) at ../../integration/VODE/vode_dvode.H:186
#4  actual_integrator<burn_t> (state=..., dt=<optimized out>) at ../../integration/VODE/actual_integrator.H:88
#5  integrator<burn_t> (state=..., dt=<optimized out>) at ../../integration/integrator.H:14
#6  burner<burn_t> (state=..., dt=<optimized out>) at ../../interfaces/burner.H:92
#7  do_react (i=<optimized out>, j=<optimized out>, k=<optimized out>, state=..., burn_state=..., n_rhs=..., p=...) at ./react_zones.H:49                                                                                                      
#8  main_main()::{lambda(int, int, int, int)#1}::operator()(int, int, int, int) const (this=<optimized out>, box_no=<optimized out>, i=<optimized out>, j=<optimized out>, k=<optimized out>) at main.cpp:211                                  
#9  amrex::ReduceOps<amrex::ReduceOpMax>::eval<amrex::FabArray<amrex::FArrayBox>, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >, main_main()::{lambda(int, int, int, int)#1}>(amrex::FabArray<amrex::FArrayBox> const&, amrex::IntVect const&, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >&, main_main()::{lambda(int, int, int, int)#1}&&)::{lambda(int, int, int)#1}::operator()(int, int, int) const (this=<optimized out>, i=<optimized out>, j=<optimized out>, k=<optimized 
out>) at /ccs/home/etjohnson/dev/amrex/Src/Base/AMReX_Reduce.H:459                                                     
#10 amrex::Reduce::detail::call_f<amrex::ReduceOps<amrex::ReduceOpMax>::eval<amrex::FabArray<amrex::FArrayBox>, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >, main_main()::{lambda(int, int, int, int)#1}>(amrex::FabArray<amrex::FArrayBox> const&, amrex::IntVect const&, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >&, main_main()::{lambda(int, int, int, int)#1}&&)::{lambda(int, int, int)#1}>(amrex::ReduceOps<amrex::ReduceOpMax>::eval<amrex::FabArray<amrex::FArrayBox>
, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >, main_main()::{lambda(int, int, int, int)#1}>(amrex::FabArray<amrex::FArrayBox> const&, amrex::IntVect const&, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >&, main_main()::{lambda(int, int, int, int)#1}&&)::{lambda(int, int, int)#1} const&, int, int, int, amrex::IndexType) (f=..., i=<optimized out>, j=<optimized out>, k=<optimized out>) at /ccs/home/etjohnson/dev/amrex/Src/Base/AMReX_Reduce.H:324
#11 amrex::ReduceOps<amrex::ReduceOpMax>::eval<amrex::ReduceData<amrex::ValLocPair<int, burn_t> >, amrex::ReduceOps<amrex::ReduceOpMax>::eval<amrex::FabArray<amrex::FArrayBox>, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >, main_main()::{lambda(int, int, int, int)#1}>(amrex::FabArray<amrex::FArrayBox> const&, amrex::IntVect const&, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >&, main_main()::{lambda(int, int, int, int)#1}&&)::{lambda(int, int, int)#1}>(amrex::Box 
const&, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >&, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >&&)::{lambda()#1}::operator()() const (this=<optimized out>) at /ccs/home/etjohnson/dev/amrex/Src/Base/AMReX_Reduce.H:545       
#12 amrex::launch<256, amrex::ReduceOps<amrex::ReduceOpMax>::eval<amrex::ReduceData<amrex::ValLocPair<int, burn_t> >, amrex::ReduceOps<amrex::ReduceOpMax>::eval<amrex::FabArray<amrex::FArrayBox>, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >, main_main()::{lambda(int, int, int, int)#1}>(amrex::FabArray<amrex::FArrayBox> const&, amrex::IntVect const&, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >&, main_main()::{lambda(int, int, int, int)#1}&&)::{lambda(int, int, i
nt)#1}>(amrex::Box const&, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >&, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >&&)::{lambda()#1}>(int, unsigned long, ihipStream_t*, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >&&)::{lambda()#1}::operator()() const (this=<optimized out>) at /ccs/home/etjohnson/dev/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H:779
#13 _ZN5amrex13launch_globalILi256EZNS_6launchILi256EZNS_9ReduceOpsIJNS_11ReduceOpMaxEEE4evalINS_10ReduceDataIJNS_10ValLocPairIi6burn_tEEEEEZNS4_4evalINS_8FabArrayINS_9FArrayBoxEEESA_Z9main_mainvEUliiiiE_EENSt9enable_ifIXaasr10IsFabArrayIT_EE5valuesr10IsCallableIT1_iiiiEE5valueEvE4typeERKSH_RKNS_7IntVectERT0_OSI_EUliiiE_EEvRKNS_3BoxERSH_OSQ_EUlvE_EEvimP12ihipStream_tSY_EUlvE_EEvSQ_.intern.3d5caca8830a6260 () at /ccs/home/etjohnson/dev/amrex/Src/Base/AMReX_GpuLaunchGlobal.H:
16

Just want to confirm: is it the case that #1422 and additional PRs will be needed to fully fix this?

That's the thinking. We won't know until we do it though. Of course, ROCm could also just fix their issues...

I really want ROCm 6.0 to be available for us to test with.

@BenWibking @zingale could I meanwhile try our Quokka simulation with #1422 as the Microphysics submodule (since we have ROCm 6.0 available)? I guess we would also need to make changes in Quokka and/or Microphysics CMakeLists?

I really want ROCm 6.0 to be available for us to test with.

We are still seeing the same memory error and crash that we were seeing before with ROCm 6.0, so something still appears to be wrong on their end.

the test_react problem with subch_simple works now with ROCm 5.7.1 with the latest version of Microphysics. So we need to find another test problem.

the test_react problem with subch_simple works now with ROCm 5.7.1 with the latest version of Microphysics. So we need to find another test problem.

unit_test/burn_cell still reports the same false positive ASAN error with ROCm 6.0. It runs fine without ASAN, though.

The debug build is still linking...

I don't think we have any more instances of pure Microphysics tests failing with ROCm > 5.3.0
For Castro, we worked around an issue and Castro now runs with ROCm 6.0: AMReX-Astro/Castro#2749