AMD GPUs produce incorrect results
Closed this issue ยท 51 comments
For the exact same setup, a Pop III simulation on Setonix GPUs show a very different evolution as compared to CPUs. The initial density projections on both are identical:
However, after a few 100 steps, GPU projections look weird whereas CPU projections look more reasonable. GPU projection:
CPU projection:
GPU run is much faster, but it needs flux correction in several cells in each timestep almost from t = 0, whereas CPU run doesn't need any flux corrections. At some point, the GPU run aborts because of rho = 0 or NAN during regridding.
There seems to be some issue with Setonix GPUs.
Can you try this on Gadi GPU and see what happens?
Also to test: run one time step only on both Setonix CPU and GPU. The result should be bit-wise identical. Are they, and, if not, are they different in all cells, or just some?
@markkrumholz - the last I tried on Gadi GPUs, they gave a memory error. Lately, the runs on gpuvolta are being put on hold, or do not start for hours. Is there an express gpu queue for testing and debugging on gadi?
There is no express GPU queue. Can you run on the RSAA GPU nodes at NCI?
After one timestep, the min/max ratio of densities of the CPU to the GPU runs is:
In [17]: np.min(dens_gpu / dens_cpu), np.max(dens_gpu / dens_cpu)
Out[17]:
(unyt_quantity(0.00999067, '(dimensionless)'),
unyt_quantity(100.00105867, '(dimensionless)'))
But the minimum and maximum densities in both the runs are similar. So, it looks like the ratio ranges from 0.01 - 100 because some core cells have the density of the background and vice-versa between the two runs.
There is no express GPU queue. Can you run on the RSAA GPU nodes at NCI?
Is there a queue name for it? Or, is this the project with Yuan Sen as the PI? I don't think I'm a part of it.
fcompare
might be a better tool to do a direct comparison between two plotfiles: https://github.com/AMReX-Codes/amrex/blob/development/Tools/Plotfile/fcompare.cpp
You can compile it by just typing make
inside extern/amrex/Tools/Plotfile
.
There is no express GPU queue. Can you run on the RSAA GPU nodes at NCI?
Is there a queue name for it? Or, is this the project with Yuan Sen as the PI? I don't think I'm a part of it.
The latter. Does this work:
#!/bin/bash
#PBS -N quokka_run
#PBS -P dg97
#PBS -q gpursaa
#PBS -l ncpus=14
#PBS -l ngpus=1
#PBS -l mem=50GB
#PBS -m aeb
#PBS -l wd
#PBS -l walltime=1:00:00
#PBS -l storage=scratch/jh2+gdata/jh2
#PBS -l jobfs=50GB
# --- initialize Quokka ---
MPI_OPTIONS="-np $PBS_NGPUS --map-by numa:SPAN --bind-to numa --mca pml ucx"
echo "Using MPI_OPTIONS = $MPI_OPTIONS"
mpirun $MPI_OPTIONS ./popiii popii.in
Yeah, I'm not a part of dg97
fcompare
might be a better tool to do a direct comparison between two plotfiles: https://github.com/AMReX-Codes/amrex/blob/development/Tools/Plotfile/fcompare.cppYou can compile it by just typing
make
insideextern/amrex/Tools/Plotfile
.
fgradient.cpp:9:10: fatal error: filesystem: No such file or directory
#include <filesystem>
^~~~~~~~~~~~
compilation terminated.
make: *** [../../Tools/GNUMake/Make.rules:89: fgradient.gnu.x86-milan.ex] Error 1
oh weird, I just checked my email, and I was also removed from dg97...
Maybe because someone realized we are not at RSAA anymore? ;)
fcompare
might be a better tool to do a direct comparison between two plotfiles: https://github.com/AMReX-Codes/amrex/blob/development/Tools/Plotfile/fcompare.cpp
You can compile it by just typingmake
insideextern/amrex/Tools/Plotfile
.fgradient.cpp:9:10: fatal error: filesystem: No such file or directory #include <filesystem> ^~~~~~~~~~~~ compilation terminated. make: *** [../../Tools/GNUMake/Make.rules:89: fgradient.gnu.x86-milan.ex] Error 1
Can you replace that line with:
#if __has_include(<filesystem>)
#include <filesystem>
#elif __has_include(<experimental/filesystem>)
#include <experimental/filesystem>
namespace std
{
namespace filesystem = experimental::filesystem;
}
#endif
psharda@setonix-02:/scratch/pawsey0807/psharda/quokka/cpubuild> ./fcompare.gnu.x86-milan.ex plt00001 ../gpubuild/plt00001
variable name absolute error relative error
(||A - B||) (||A - B||/||A||)
----------------------------------------------------------------------------
amrex::Abort::0::ERROR: grids do not match !!!
SIGABRT
See Backtrace.0 file for details
I get the same error for 0th plt files too. Why wont the grids match?
Grid generation is not independent of the number of MPI ranks, so if the number of ranks you've used is different on the CPU vs. GPU test, you won't necessarily get the same grid layout. For the purposes of this test, you might also try turning off AMR and seeing if, when you use a completely flag grid structure, the differences between CPU and GPU persist.
I have emailed Yuan-Sen and asked for you two to be added back to dg97.
You can allow different grids:
I re-joined dg97
scratch on Setonix is down. Once it's back up, I'll be able to fcompare
again
@BenWibking @markkrumholz something is definitely up with Setonix GPUs. Here are the density projections for a timestep from Gadi CPU and GPU:
Can you fcompare between these two?
ps3459@gadi-login-08:/scratch/jh2/ps3459/quokka/build>./fcompare.gnu.ex -a plt00750 ../pold/plt00750
variable name absolute error relative error
(||A - B||) (||A - B||/||A||)
----------------------------------------------------------------------------
level = 0
gasDensity 1.438428927e-33 3.088436355e-14
x-GasMomentum 1.932709218e-28 2.786481368e-14
y-GasMomentum 1.522501547e-28 2.262997499e-14
z-GasMomentum 2.358694107e-28 3.252484972e-14
gasEnergy 3.019209236e-23 1.028886349e-14
gasInternalEnergy 2.543580384e-23 9.592771651e-15
scalar_0 5.437038042e-43 3.174728929e-14
scalar_1 9.987110207e-40 3.176040866e-14
scalar_2 1.086344859e-33 3.057775805e-14
scalar_3 6.260684057e-47 3.216521655e-14
scalar_4 5.149771856e-44 3.300415905e-14
scalar_5 6.612155723e-38 3.109016504e-14
scalar_6 1.357764473e-48 3.276051438e-14
scalar_7 5.136719677e-51 3.265735903e-14
scalar_8 8.963144425e-37 3.237877791e-14
scalar_9 1.298535589e-52 3.312037584e-14
scalar_10 2.712016996e-40 3.1961632e-14
scalar_11 1.795210709e-83 3.185581266e-14
scalar_12 1.09031628e-76 3.17798777e-14
scalar_13 3.535887007e-34 3.209328795e-14
temperature 6.959453458e-09 2.321962785e-13
velx 1.21886842e-07 2.708626196e-13
pressure 1.695720256e-23 9.595083403e-15
sound_speed 2.512242645e-07 1.359843815e-13
Excellent, machine precision differences, as expected.
We should not do any production calculations on Setonix GPUs until this is fixed.
Do we discuss/report this to Setonix people?
Do we discuss/report this to Setonix people?
Yes, please open a help ticket and include a link to this GitHub issue.
If anyone still feels like debugging this, it might be worth checking whether this problem is fixed by disabling GPU-aware MPI by setting amrex.use_gpu_aware_mpi=0
(https://amrex-codes.github.io/amrex/docs_html/GPU.html#inputs-parameters).
If anyone still feels like debugging this, it might be worth checking whether this problem is fixed by disabling GPU-aware MPI by setting
amrex.use_gpu_aware_mpi=0
(https://amrex-codes.github.io/amrex/docs_html/GPU.html#inputs-parameters).In the .in file?
Yes, this can be set there, or on the command line (i.e. srun ./popiii input_file.in amrex.use_gpu_aware_mpi=0
)
Ah, so if we do this, we shouldnt do export MPICH_GPU_SUPPORT_ENABLED=1
in the setonix-1node.submit script.
I used export MPICH_GPU_SUPPORT_ENABLED=0
in addition to amrex.use_gpu_aware_mpi=0
. Run on setonix gpus still crashed after 60 steps, with the same flux correction and rho being NAN errors.
I used
export MPICH_GPU_SUPPORT_ENABLED=0
in addition toamrex.use_gpu_aware_mpi=0
. Run on setonix gpus still crashed after 60 steps, with the same flux correction and rho being NAN errors.
Well, I guess that's not it, then...
We might have to wait until the AMD GPU node is set up at RSAA and test this on an independent system with an up-to-date version of the AMD software stack.
@psharda Would it be possible to shrink this problem to fit on a single Setonix GPU and test that? Then we would also have a test case ready to run on the Avatar AMDGPU node.
This would also help isolate whether it's an MPI bug or something wrong with the GPU computation itself.
I mean, I only used 1 setonix node to run these. So I guess that was 8 GPUs? I could try and run it on 1 GPU only.
Setonix doesn't like this, do you know why?
#!/bin/bash
#SBATCH -A pawsey0807-gpu
#SBATCH -J quokka_benchmark
#SBATCH -o 1node_%x-%j.out
#SBATCH -t 03:10:00
#SBATCH -p gpu-dev
#SBATCH --exclusive
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=1
##SBATCH --core-spec=8
#SBATCH -N 1
srun: error: nid003002: task 0: Segmentation fault (core dumped)
srun: launch/slurm: _step_signal: Terminating StepId=4670973.0
Hmm, no. Can you try the "Example 1: One process with a single GPU using shared node access" script here: https://support.pawsey.org.au/documentation/display/US/Setonix+GPU+Partition+Quick+Start#SetonixGPUPartitionQuickStart-Compilingsoftware
Nope, didn't work. I feel like I'm missing something trivial. Can you give it a try?
I am able to run the HydroBlast3D problem with this script:
#!/bin/bash --login
#SBATCH --account=pawsey0807-gpu
#SBATCH --partition=gpu
#SBATCH --nodes=1 #1 nodes in this example
#SBATCH --gpus-per-node=1 #1 GPUs per node (1 "allocation packs" in total for the job)
#SBATCH --time=00:20:00
# load modules
module load craype-accel-amd-gfx90a
module load rocm/5.2.3
srun -N 1 -n 1 -c 8 --gpus-per-node=1 ./build/src/HydroBlast3D/test_hydro3d_blast tests/blast_unigrid_256.in
All of the tests pass, except for the ChannelFlow test, which fails due to a compiler bug.
Tried the script
I am able to run the HydroBlast3D problem with this script:
#!/bin/bash --login #SBATCH --account=pawsey0807-gpu #SBATCH --partition=gpu #SBATCH --nodes=1 #1 nodes in this example #SBATCH --gpus-per-node=1 #1 GPUs per node (1 "allocation packs" in total for the job) #SBATCH --time=00:20:00 # load modules module load craype-accel-amd-gfx90a module load rocm/5.2.3 srun -N 1 -n 1 -c 8 --gpus-per-node=1 ./build/src/HydroBlast3D/test_hydro3d_blast tests/blast_unigrid_256.in
Tried the script. PopIII sim still fails after 67 steps.
That's unfortunate. This has all the hallmarks of a compiler/driver bug... hopefully the AMDGPU testing node will be up and running soon.
It would be useful to re-test on Setonix with export HSA_ENABLE_SDMA=0
set. There are known bugs in the GPU driver than can cause incorrect results when this is set to 1 (which is the default).
@psharda for reference, can you post here the error you get when running on Moth?
Possibly related to: AMReX-Codes/amrex#3623
The memory errors are fixed by adding-mllvm -amdgpu-function-calls=true
to the compiler flags.
This works around AMDGPU compiler bugs when building very large kernels (the primordial chemistry network, the larger nuclear networks).
There is still a real bug (at least, it also appears on CPU runs) where, at timestep 58, FOFC happens and then then VODE fails to integrate the network (from @psharda):
Coarse STEP 58 at t = 2277615624 (2.277615624e-05%) starts ...
>> Using global timestep on this coarse step (estimated work ratio: 1).
[Level 0 step 58] ADVANCE with time = 2277615624 dt = 228761562.4
[FOFC-1] flux correcting 1 cells on level 0
Coordinates: (-1.215046875e+18, -2.719390625e+18, 8.67890625e+17): At cell (21,8,39) in Box ((0,0,32) (31,31,63) (0,0,0)): -7.6231090543352448e-24, -1.069268058649304e-17, 7.410320892898785e-18, 3.7609666350373797e-18, 5.3526763892812115e-10, 5.2902546443125309e-10, -2.8030653010662632e-33, -5.1467865827886286e-30, -5.8149295047562081e-24, -3.1857928751784568e-37, -2.5538870986621986e-34, -3.4809871986548917e-28, -6.7835317008872414e-39, -2.5744653822924568e-41, -4.5308774871340948e-27, -6.4171286356493599e-43, -1.3888196456542485e-30, -9.2237860360589853e-74, -5.6154994834351863e-67, -1.8032940347070311e-24
[FOFC-2] flux correcting 1 cells on level 0
Coordinates: (-1.215046875e+18, -2.719390625e+18, 8.67890625e+17): At cell (21,8,39) in Box ((0,0,32) (31,31,63) (0,0,0)): -4.4439055516774191e-24, -9.7063852770399433e-18, 9.2891283062859141e-18, 3.8792162629719635e-18, 5.3612761960843476e-10, 5.2929401031211604e-10, -1.6338050081804726e-33, -2.9998038185290005e-30, -3.3898238659059865e-24, -1.8648566265595356e-37, -1.4894234044896535e-34, -2.0292483481208408e-28, -3.98069827049991e-39, -1.5061288633265628e-41, -2.6413500902039217e-27, -3.7645391218040064e-43, -8.0955352015117408e-31, -5.3722206263769706e-74, -3.102252662899324e-67, -1.0512335997061365e-24
DVODE: corrector convergence failed repeatedly or with abs(H) = HMIN
[ERROR] integration failed in net
istate = -6
zone = (0, 0, 0)
time = 0
dt = 114380781.1951238
temp start = 84.62999999999991
xn start = 0.007969664777026481 0.007969467582489294 8999.891176964589 4.929760939497877e-10 1.977453278654315e-07 0.269463080239167 5.254014144099686e-12 1.992932720226845e-14 3.506310230602529 3.313755276663732e-16 0.0007166261480667435 3.569520310420554e-47 2.119713584959603e-40 697.9645364035052
dens current = 1.974508903110628e-20
temp current = -nan
xn current = 0.007969664777026481 0.007969467582489294 8999.891176964589 4.929760939497877e-10 1.977453278654315e-07 0.269463080239167 5.254014144099686e-12 1.992932720226845e-14 3.506310230602529 3.313755276663732e-16 0.0007166261480667435 3.569520310420554e-47 2.119713584959603e-40 697.9645364035052
energy generated = 8676380294.558048
amrex::Abort::4::VODE integration was unsuccessful! !!!
For future reference, the underlying problem is that the AMDGPU compiler is fundamentally broken in how it spills registers to memory: https://discourse.llvm.org/t/the-current-state-of-spilling-function-calls-and-related-problems/2863
I think we just can't use AMD GPUs for chemistry until they rewrite how their compiler does register allocation.
ROCm 6.3.1 includes a fix for the above problem, and preliminary testing indicates that it may fix all of the issues we've been seeing. Still needs more testing and manual verification against CPU and NVIDIA GPU runs.
We should be able to verify correct results going forward (ROCm 6.3.1 and newer) by running the nightly test suite on moth and checking against the benchmark outputs generated on avatargpu: #842
At timestep 50, both L1 norm and L_inf (not shown) norm differences are near machine precision:
bwibking@avatargpu:~/quokka/tests> fcompare.gnu.ex --norm 1 nvidia_popiii/plt00050 amd_popiii/plt00050
variable name absolute error relative error
(||A - B||) (||A - B||/||A||)
----------------------------------------------------------------------------
level = 0
gasDensity 5.216624743e+18 2.075855966e-18
x-GasMomentum 1.160539747e+25 6.564939963e-17
y-GasMomentum 1.375261869e+25 5.701645036e-17
z-GasMomentum 1.74420404e+24 8.399198593e-18
gasEnergy 1.710167259e+32 6.575113471e-16
gasInternalEnergy 1.710468476e+32 7.805209525e-16
scalar_0 7.207725993e+11 7.800407543e-16
scalar_1 1.32293485e+15 7.797570943e-16
scalar_2 6.716274575e+18 3.503674422e-18
scalar_3 337235798.9 3.210598535e-15
scalar_4 7.749760838e+10 9.205176911e-16
scalar_5 2.859665207e+15 2.492025619e-17
scalar_6 11296306.86 5.050177214e-15
scalar_7 23930.57354 2.819337835e-15
scalar_8 2.231752727e+16 1.494173458e-17
scalar_9 983.3212116 4.647164592e-15
scalar_10 4.400416355e+13 9.611545223e-17
scalar_11 5.240397471e-31 1.723784685e-17
scalar_12 1.108886623e-23 6.062794383e-17
scalar_13 5.271415788e+18 8.867485339e-18
x-RiemannSolverVelocity 0 0
y-RiemannSolverVelocity 0 0
z-RiemannSolverVelocity 0 0
temperature 4.961384062e+44 6.626650941e-17
velx 1.499974573e+45 5.412996068e-17
pressure 1.176177696e+32 8.052640999e-16
sound_speed 4.564456461e+46 8.925311155e-17
At timestep 300, the differences between the two simulations have increased many orders of magnitude, but are still relatively small in a physically-meaningful sense:
bwibking@avatargpu:~/quokka/tests> fcompare.gnu.ex --norm 1 nvidia_popiii/plt00300 amd_popiii/plt00300
variable name absolute error relative error
(||A - B||) (||A - B||/||A||)
----------------------------------------------------------------------------
level = 0
gasDensity 2.874134521e+33 0.001143482489
x-GasMomentum 5.683171202e+38 0.003389429343
y-GasMomentum 6.304958605e+38 0.002737922331
z-GasMomentum 6.221109595e+38 0.003039493229
gasEnergy 6.289270748e+44 0.002425382794
gasInternalEnergy 5.915318003e+44 0.002674959976
scalar_0 1.166839697e+24 0.001328779693
scalar_1 2.142444053e+27 0.001328780667
scalar_2 2.192385658e+33 0.001143483437
scalar_3 6.036410028e+21 0.06436420467
scalar_4 1.280581391e+23 0.001575431427
scalar_5 1.317784541e+29 0.001147502569
scalar_6 1.047573674e+20 0.03863222743
scalar_7 6.45151327e+17 0.1105975557
scalar_8 1.721799522e+30 0.001142924101
scalar_9 8.70880443e+15 0.03489911394
scalar_10 4.101266192e+26 0.001130660874
scalar_11 2.914787135e-17 0.001244073829
scalar_12 6.140657043e-06 0.508252593
scalar_13 6.798944076e+32 0.001143482397
x-RiemannSolverVelocity 0 0
y-RiemannSolverVelocity 0 0
z-RiemannSolverVelocity 0 0
temperature 1.371472385e+58 0.002002935733
velx 6.992331247e+59 0.02272055639
pressure 3.942593018e+44 0.0026749605
sound_speed 6.012219996e+59 0.001238184711