AMD GPUs produce incorrect results

For the exact same setup, a Pop III simulation on Setonix GPUs show a very different evolution as compared to CPUs. The initial density projections on both are identical:

However, after a few 100 steps, GPU projections look weird whereas CPU projections look more reasonable. GPU projection:

CPU projection:

GPU run is much faster, but it needs flux correction in several cells in each timestep almost from t = 0, whereas CPU run doesn't need any flux corrections. At some point, the GPU run aborts because of rho = 0 or NAN during regridding.

There seems to be some issue with Setonix GPUs.

cc @markkrumholz @aditivijayan

Can you try this on Gadi GPU and see what happens?

Also to test: run one time step only on both Setonix CPU and GPU. The result should be bit-wise identical. Are they, and, if not, are they different in all cells, or just some?

@markkrumholz - the last I tried on Gadi GPUs, they gave a memory error. Lately, the runs on gpuvolta are being put on hold, or do not start for hours. Is there an express gpu queue for testing and debugging on gadi?

There is no express GPU queue. Can you run on the RSAA GPU nodes at NCI?

After one timestep, the min/max ratio of densities of the CPU to the GPU runs is:

In [17]: np.min(dens_gpu / dens_cpu), np.max(dens_gpu / dens_cpu)
Out[17]:
(unyt_quantity(0.00999067, '(dimensionless)'),
 unyt_quantity(100.00105867, '(dimensionless)'))

But the minimum and maximum densities in both the runs are similar. So, it looks like the ratio ranges from 0.01 - 100 because some core cells have the density of the background and vice-versa between the two runs.

There is no express GPU queue. Can you run on the RSAA GPU nodes at NCI?

Is there a queue name for it? Or, is this the project with Yuan Sen as the PI? I don't think I'm a part of it.

fcompare might be a better tool to do a direct comparison between two plotfiles: https://github.com/AMReX-Codes/amrex/blob/development/Tools/Plotfile/fcompare.cpp

You can compile it by just typing make inside extern/amrex/Tools/Plotfile.

There is no express GPU queue. Can you run on the RSAA GPU nodes at NCI?

Is there a queue name for it? Or, is this the project with Yuan Sen as the PI? I don't think I'm a part of it.

The latter. Does this work:

#!/bin/bash

#PBS -N quokka_run
#PBS -P dg97
#PBS -q gpursaa
#PBS -l ncpus=14
#PBS -l ngpus=1
#PBS -l mem=50GB
#PBS -m aeb
#PBS -l wd
#PBS -l walltime=1:00:00
#PBS -l storage=scratch/jh2+gdata/jh2
#PBS -l jobfs=50GB

# --- initialize Quokka ---
MPI_OPTIONS="-np $PBS_NGPUS --map-by numa:SPAN --bind-to numa --mca pml ucx"
echo "Using MPI_OPTIONS = $MPI_OPTIONS"

mpirun $MPI_OPTIONS ./popiii popii.in

Yeah, I'm not a part of dg97

fcompare might be a better tool to do a direct comparison between two plotfiles: https://github.com/AMReX-Codes/amrex/blob/development/Tools/Plotfile/fcompare.cpp

You can compile it by just typing make inside extern/amrex/Tools/Plotfile.

fgradient.cpp:9:10: fatal error: filesystem: No such file or directory
 #include <filesystem>
          ^~~~~~~~~~~~
compilation terminated.
make: *** [../../Tools/GNUMake/Make.rules:89: fgradient.gnu.x86-milan.ex] Error 1

oh weird, I just checked my email, and I was also removed from dg97...

Maybe because someone realized we are not at RSAA anymore? ;)

fcompare might be a better tool to do a direct comparison between two plotfiles: https://github.com/AMReX-Codes/amrex/blob/development/Tools/Plotfile/fcompare.cpp
You can compile it by just typing make inside extern/amrex/Tools/Plotfile.
fgradient.cpp:9:10: fatal error: filesystem: No such file or directory
 #include <filesystem>
          ^~~~~~~~~~~~
compilation terminated.
make: *** [../../Tools/GNUMake/Make.rules:89: fgradient.gnu.x86-milan.ex] Error 1

Can you replace that line with:

#if __has_include(<filesystem>)
#include <filesystem>
#elif __has_include(<experimental/filesystem>)
#include <experimental/filesystem>
namespace std
{
namespace filesystem = experimental::filesystem;
}
#endif

I’ll check with Yuan-Sen about dg97 when I get to the office in an hour. --Mark KrumholzTyped on a tiny keyboardOn Sep 22, 2023, at 8:35 AM, Ben Wibking ***@***.***> wrote: oh weird, I just checked my email, and I was also removed from dg97... —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

psharda@setonix-02:/scratch/pawsey0807/psharda/quokka/cpubuild> ./fcompare.gnu.x86-milan.ex plt00001 ../gpubuild/plt00001

            variable name            absolute error            relative error
                                        (||A - B||)         (||A - B||/||A||)
 ----------------------------------------------------------------------------
amrex::Abort::0::ERROR: grids do not match !!!
SIGABRT
See Backtrace.0 file for details

I get the same error for 0th plt files too. Why wont the grids match?

Grid generation is not independent of the number of MPI ranks, so if the number of ranks you've used is different on the CPU vs. GPU test, you won't necessarily get the same grid layout. For the purposes of this test, you might also try turning off AMR and seeing if, when you use a completely flag grid structure, the differences between CPU and GPU persist.

I have emailed Yuan-Sen and asked for you two to be added back to dg97.

You can allow different grids:

https://github.com/AMReX-Codes/amrex/blob/b2052f2d2e5ce44317450ba13de705a3e01ef0ea/Tools/Plotfile/fcompare.cpp#L35

I re-joined dg97

scratch on Setonix is down. Once it's back up, I'll be able to fcompare again

@BenWibking @markkrumholz something is definitely up with Setonix GPUs. Here are the density projections for a timestep from Gadi CPU and GPU:

Gadi CPU:
Gadi GPU:

Can you fcompare between these two?

ps3459@gadi-login-08:/scratch/jh2/ps3459/quokka/build>./fcompare.gnu.ex -a plt00750 ../pold/plt00750

            variable name            absolute error            relative error
                                        (||A - B||)         (||A - B||/||A||)
 ----------------------------------------------------------------------------
 level = 0
 gasDensity                         1.438428927e-33           3.088436355e-14
 x-GasMomentum                      1.932709218e-28           2.786481368e-14
 y-GasMomentum                      1.522501547e-28           2.262997499e-14
 z-GasMomentum                      2.358694107e-28           3.252484972e-14
 gasEnergy                          3.019209236e-23           1.028886349e-14
 gasInternalEnergy                  2.543580384e-23           9.592771651e-15
 scalar_0                           5.437038042e-43           3.174728929e-14
 scalar_1                           9.987110207e-40           3.176040866e-14
 scalar_2                           1.086344859e-33           3.057775805e-14
 scalar_3                           6.260684057e-47           3.216521655e-14
 scalar_4                           5.149771856e-44           3.300415905e-14
 scalar_5                           6.612155723e-38           3.109016504e-14
 scalar_6                           1.357764473e-48           3.276051438e-14
 scalar_7                           5.136719677e-51           3.265735903e-14
 scalar_8                           8.963144425e-37           3.237877791e-14
 scalar_9                           1.298535589e-52           3.312037584e-14
 scalar_10                          2.712016996e-40             3.1961632e-14
 scalar_11                          1.795210709e-83           3.185581266e-14
 scalar_12                           1.09031628e-76            3.17798777e-14
 scalar_13                          3.535887007e-34           3.209328795e-14
 temperature                        6.959453458e-09           2.321962785e-13
 velx                                1.21886842e-07           2.708626196e-13
 pressure                           1.695720256e-23           9.595083403e-15
 sound_speed                        2.512242645e-07           1.359843815e-13

Excellent, machine precision differences, as expected.

We should not do any production calculations on Setonix GPUs until this is fixed.

Do we discuss/report this to Setonix people?

Do we discuss/report this to Setonix people?

Yes, please open a help ticket and include a link to this GitHub issue.

If anyone still feels like debugging this, it might be worth checking whether this problem is fixed by disabling GPU-aware MPI by setting amrex.use_gpu_aware_mpi=0 (https://amrex-codes.github.io/amrex/docs_html/GPU.html#inputs-parameters).

If anyone still feels like debugging this, it might be worth checking whether this problem is fixed by disabling GPU-aware MPI by setting amrex.use_gpu_aware_mpi=0 (https://amrex-codes.github.io/amrex/docs_html/GPU.html#inputs-parameters).

In the .in file?

Yes, this can be set there, or on the command line (i.e. srun ./popiii input_file.in amrex.use_gpu_aware_mpi=0)

Ah, so if we do this, we shouldnt do export MPICH_GPU_SUPPORT_ENABLED=1 in the setonix-1node.submit script.

I used export MPICH_GPU_SUPPORT_ENABLED=0 in addition to amrex.use_gpu_aware_mpi=0. Run on setonix gpus still crashed after 60 steps, with the same flux correction and rho being NAN errors.

I used export MPICH_GPU_SUPPORT_ENABLED=0 in addition to amrex.use_gpu_aware_mpi=0. Run on setonix gpus still crashed after 60 steps, with the same flux correction and rho being NAN errors.

Well, I guess that's not it, then...

We might have to wait until the AMD GPU node is set up at RSAA and test this on an independent system with an up-to-date version of the AMD software stack.

@psharda Would it be possible to shrink this problem to fit on a single Setonix GPU and test that? Then we would also have a test case ready to run on the Avatar AMDGPU node.

This would also help isolate whether it's an MPI bug or something wrong with the GPU computation itself.

I mean, I only used 1 setonix node to run these. So I guess that was 8 GPUs? I could try and run it on 1 GPU only.

Setonix doesn't like this, do you know why?

#!/bin/bash

#SBATCH -A pawsey0807-gpu
#SBATCH -J quokka_benchmark
#SBATCH -o 1node_%x-%j.out
#SBATCH -t 03:10:00
#SBATCH -p gpu-dev
#SBATCH --exclusive
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=1
##SBATCH --core-spec=8
#SBATCH -N 1

srun: error: nid003002: task 0: Segmentation fault (core dumped)
srun: launch/slurm: _step_signal: Terminating StepId=4670973.0

Hmm, no. Can you try the "Example 1: One process with a single GPU using shared node access" script here: https://support.pawsey.org.au/documentation/display/US/Setonix+GPU+Partition+Quick+Start#SetonixGPUPartitionQuickStart-Compilingsoftware

Nope, didn't work. I feel like I'm missing something trivial. Can you give it a try?

I am able to run the HydroBlast3D problem with this script:

#!/bin/bash --login
#SBATCH --account=pawsey0807-gpu
#SBATCH --partition=gpu
#SBATCH --nodes=1              #1 nodes in this example
#SBATCH --gpus-per-node=1      #1 GPUs per node (1 "allocation packs" in total for the job)
#SBATCH --time=00:20:00

# load modules
module load craype-accel-amd-gfx90a
module load rocm/5.2.3

srun -N 1 -n 1 -c 8 --gpus-per-node=1 ./build/src/HydroBlast3D/test_hydro3d_blast tests/blast_unigrid_256.in

All of the tests pass, except for the ChannelFlow test, which fails due to a compiler bug.

Tried the script

I am able to run the HydroBlast3D problem with this script:

#!/bin/bash --login
#SBATCH --account=pawsey0807-gpu
#SBATCH --partition=gpu
#SBATCH --nodes=1              #1 nodes in this example
#SBATCH --gpus-per-node=1      #1 GPUs per node (1 "allocation packs" in total for the job)
#SBATCH --time=00:20:00

# load modules
module load craype-accel-amd-gfx90a
module load rocm/5.2.3

srun -N 1 -n 1 -c 8 --gpus-per-node=1 ./build/src/HydroBlast3D/test_hydro3d_blast tests/blast_unigrid_256.in

Tried the script. PopIII sim still fails after 67 steps.

That's unfortunate. This has all the hallmarks of a compiler/driver bug... hopefully the AMDGPU testing node will be up and running soon.

It would be useful to re-test on Setonix with export HSA_ENABLE_SDMA=0 set. There are known bugs in the GPU driver than can cause incorrect results when this is set to 1 (which is the default).

@psharda for reference, can you post here the error you get when running on Moth?

Possibly related to: AMReX-Codes/amrex#3623

@psharda Weiqun suggested adding amrex::Gpu::streamSynchronize() immediately after the chemistry ParallelFor here:

quokka/src/Chemistry.hpp

Line 144 in 2d863d6

});

That fixed the memory crashes with Castro on AMD GPUs when doing nuclear reactions.

The memory errors are fixed by adding-mllvm -amdgpu-function-calls=true to the compiler flags.

This works around AMDGPU compiler bugs when building very large kernels (the primordial chemistry network, the larger nuclear networks).

There is still a real bug (at least, it also appears on CPU runs) where, at timestep 58, FOFC happens and then then VODE fails to integrate the network (from @psharda):

Coarse STEP 58 at t = 2277615624 (2.277615624e-05%) starts ...
	>> Using global timestep on this coarse step (estimated work ratio: 1).
[Level 0 step 58] ADVANCE with time = 2277615624 dt = 228761562.4
[FOFC-1] flux correcting 1 cells on level 0
Coordinates: (-1.215046875e+18, -2.719390625e+18, 8.67890625e+17):  At cell (21,8,39) in Box ((0,0,32) (31,31,63) (0,0,0)): -7.6231090543352448e-24, -1.069268058649304e-17, 7.410320892898785e-18, 3.7609666350373797e-18, 5.3526763892812115e-10, 5.2902546443125309e-10, -2.8030653010662632e-33, -5.1467865827886286e-30, -5.8149295047562081e-24, -3.1857928751784568e-37, -2.5538870986621986e-34, -3.4809871986548917e-28, -6.7835317008872414e-39, -2.5744653822924568e-41, -4.5308774871340948e-27, -6.4171286356493599e-43, -1.3888196456542485e-30, -9.2237860360589853e-74, -5.6154994834351863e-67, -1.8032940347070311e-24
[FOFC-2] flux correcting 1 cells on level 0
Coordinates: (-1.215046875e+18, -2.719390625e+18, 8.67890625e+17):  At cell (21,8,39) in Box ((0,0,32) (31,31,63) (0,0,0)): -4.4439055516774191e-24, -9.7063852770399433e-18, 9.2891283062859141e-18, 3.8792162629719635e-18, 5.3612761960843476e-10, 5.2929401031211604e-10, -1.6338050081804726e-33, -2.9998038185290005e-30, -3.3898238659059865e-24, -1.8648566265595356e-37, -1.4894234044896535e-34, -2.0292483481208408e-28, -3.98069827049991e-39, -1.5061288633265628e-41, -2.6413500902039217e-27, -3.7645391218040064e-43, -8.0955352015117408e-31, -5.3722206263769706e-74, -3.102252662899324e-67, -1.0512335997061365e-24
DVODE: corrector convergence failed repeatedly or with abs(H) = HMIN
[ERROR] integration failed in net
istate = -6
zone = (0, 0, 0)
time = 0
dt = 114380781.1951238
temp start = 84.62999999999991
xn start = 0.007969664777026481 0.007969467582489294 8999.891176964589 4.929760939497877e-10 1.977453278654315e-07 0.269463080239167 5.254014144099686e-12 1.992932720226845e-14 3.506310230602529 3.313755276663732e-16 0.0007166261480667435 3.569520310420554e-47 2.119713584959603e-40 697.9645364035052
dens current = 1.974508903110628e-20
temp current = -nan
xn current = 0.007969664777026481 0.007969467582489294 8999.891176964589 4.929760939497877e-10 1.977453278654315e-07 0.269463080239167 5.254014144099686e-12 1.992932720226845e-14 3.506310230602529 3.313755276663732e-16 0.0007166261480667435 3.569520310420554e-47 2.119713584959603e-40 697.9645364035052
energy generated = 8676380294.558048
amrex::Abort::4::VODE integration was unsuccessful! !!!

For future reference, the underlying problem is that the AMDGPU compiler is fundamentally broken in how it spills registers to memory: https://discourse.llvm.org/t/the-current-state-of-spilling-function-calls-and-related-problems/2863

I think we just can't use AMD GPUs for chemistry until they rewrite how their compiler does register allocation.

ROCm 6.3.1 includes a fix for the above problem, and preliminary testing indicates that it may fix all of the issues we've been seeing. Still needs more testing and manual verification against CPU and NVIDIA GPU runs.

Plotfile output at timestep 300 from a test run on moth with ROCm 6.3.1:

We should be able to verify correct results going forward (ROCm 6.3.1 and newer) by running the nightly test suite on moth and checking against the benchmark outputs generated on avatargpu: #842

At timestep 50, both L1 norm and L_inf (not shown) norm differences are near machine precision:

bwibking@avatargpu:~/quokka/tests> fcompare.gnu.ex --norm 1 nvidia_popiii/plt00050 amd_popiii/plt00050

            variable name            absolute error            relative error
                                        (||A - B||)         (||A - B||/||A||)
 ----------------------------------------------------------------------------
 level = 0
 gasDensity                         5.216624743e+18           2.075855966e-18
 x-GasMomentum                      1.160539747e+25           6.564939963e-17
 y-GasMomentum                      1.375261869e+25           5.701645036e-17
 z-GasMomentum                       1.74420404e+24           8.399198593e-18
 gasEnergy                          1.710167259e+32           6.575113471e-16
 gasInternalEnergy                  1.710468476e+32           7.805209525e-16
 scalar_0                           7.207725993e+11           7.800407543e-16
 scalar_1                            1.32293485e+15           7.797570943e-16
 scalar_2                           6.716274575e+18           3.503674422e-18
 scalar_3                               337235798.9           3.210598535e-15
 scalar_4                           7.749760838e+10           9.205176911e-16
 scalar_5                           2.859665207e+15           2.492025619e-17
 scalar_6                               11296306.86           5.050177214e-15
 scalar_7                               23930.57354           2.819337835e-15
 scalar_8                           2.231752727e+16           1.494173458e-17
 scalar_9                               983.3212116           4.647164592e-15
 scalar_10                          4.400416355e+13           9.611545223e-17
 scalar_11                          5.240397471e-31           1.723784685e-17
 scalar_12                          1.108886623e-23           6.062794383e-17
 scalar_13                          5.271415788e+18           8.867485339e-18
 x-RiemannSolverVelocity                          0                         0
 y-RiemannSolverVelocity                          0                         0
 z-RiemannSolverVelocity                          0                         0
 temperature                        4.961384062e+44           6.626650941e-17
 velx                               1.499974573e+45           5.412996068e-17
 pressure                           1.176177696e+32           8.052640999e-16
 sound_speed                        4.564456461e+46           8.925311155e-17

At timestep 300, the differences between the two simulations have increased many orders of magnitude, but are still relatively small in a physically-meaningful sense:

bwibking@avatargpu:~/quokka/tests> fcompare.gnu.ex --norm 1 nvidia_popiii/plt00300 amd_popiii/plt00300

            variable name            absolute error            relative error
                                        (||A - B||)         (||A - B||/||A||)
 ----------------------------------------------------------------------------
 level = 0
 gasDensity                         2.874134521e+33            0.001143482489
 x-GasMomentum                      5.683171202e+38            0.003389429343
 y-GasMomentum                      6.304958605e+38            0.002737922331
 z-GasMomentum                      6.221109595e+38            0.003039493229
 gasEnergy                          6.289270748e+44            0.002425382794
 gasInternalEnergy                  5.915318003e+44            0.002674959976
 scalar_0                           1.166839697e+24            0.001328779693
 scalar_1                           2.142444053e+27            0.001328780667
 scalar_2                           2.192385658e+33            0.001143483437
 scalar_3                           6.036410028e+21             0.06436420467
 scalar_4                           1.280581391e+23            0.001575431427
 scalar_5                           1.317784541e+29            0.001147502569
 scalar_6                           1.047573674e+20             0.03863222743
 scalar_7                            6.45151327e+17              0.1105975557
 scalar_8                           1.721799522e+30            0.001142924101
 scalar_9                            8.70880443e+15             0.03489911394
 scalar_10                          4.101266192e+26            0.001130660874
 scalar_11                          2.914787135e-17            0.001244073829
 scalar_12                          6.140657043e-06               0.508252593
 scalar_13                          6.798944076e+32            0.001143482397
 x-RiemannSolverVelocity                          0                         0
 y-RiemannSolverVelocity                          0                         0
 z-RiemannSolverVelocity                          0                         0
 temperature                        1.371472385e+58            0.002002935733
 velx                               6.992331247e+59             0.02272055639
 pressure                           3.942593018e+44              0.0026749605
 sound_speed                        6.012219996e+59            0.001238184711

AMD GPU (moth):

NVIDIA GPU (avatargpu):