AMReX-Astro/Castro

Castro reactions fail with ROCm > 5.3.0

zingale opened this issue · 14 comments

Using this issue to track problems with more recent versions of ROCm

To build with ROCm 5.7.0 on Frontier:

module load cpe/23.09
module load rocm/5.7.0
module load PrgEnv-gnu craype-accel-amd-gfx90a cray-mpich

Running the 2D subchandra problem on 4 nodes / 32 GPUs:

  • We run fine if compiled with DEBUG=TRUE (although the compilng / linking takes a Hubble time)
  • We crash with problems in the reaction integration right away (step 1) if compiled normally.

More precisely, the burning fails and the step is thrown out and we need to subcycle.

The reacting_convergence problem works fine, it seems. This could be because of Strang vs. SDC, or the type of network.

actually... I misspoke -- reacting_convergece does have a burn failure in the first step, but it then seems to recover so the run completes. But I don't see this behavior when I run locally on a CPU.

okay, I confirm that there is also no burning failures running reacting_convergence with ROCm 5.3.0

So the reacting_convergence problem with inputs.64 on one node seems to be a good test problem for this.

Some summary:

  • it doesn't seem to matter SDC or Strang
  • I've seen the problem with both new-style networks and pynucastro networks

okay, with subchandra and Strang, and ROCm 5.7.0 I now get:

... Entering burner on level 0 and doing half-timestep of burning.

:0:rocdevice.cpp            :2692: 142346779052 us: [pid:90391 tid:0x7fffd6530700] Callback: Queue 0x7ffe91600000 aborting with 
error : HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access memory beyond the largest legal address. code:
 0x29
SIGABRT
Memory access fault by GPU node-8 (Agent handle: 0x2975b60) on address 0x800033773000. Reason: Unknown.
SIGABRT
See Backtrace.24 file for details
See Backtrace.7 file for details

Is this issue we're seeing when running problems with Microphysics in Quokka possibly related: quokka-astro/quokka#394?

It only happens on AMDGPU, and only when we use reaction networks.

We run without problem with ROCm 5.3.0 and reactions. So I am not sure if it is the same.

We run without problem with ROCm 5.3.0 and reactions. So I am not sure if it is the same.

Hmm, that makes it probably unrelated. Have you happened to run with ROCm 5.2.3? We only have access to this version on our machine.

possibly related:
AMReX-Codes/amrex#3623

this is fixed in #2749, and is most likely a ROCm compiler bug

yut23 commented

I just tested the failing subchandra setup with the hipcc force-inlining turned off and without #2749, and that also resolves the memory error.

Wow, ok. That's good to know!