Segmentation fault when using AMR and passive scalars with user-defined boundary functions
sunnywong314 opened this issue · 12 comments
Prerequisite checklist
Place an X in between the brackets on each line as you complete these checks:
- Did you check that the issue hasn't already been reported?
- Did you check the documentation in the Wiki for an answer?
- Are you running the latest version of Athena++?
Summary of issue
I am running my own problem generator (based on code by @c-white) where supernova ejecta enters the box (as a user boundary condition) and interacts with a companion star. When running with AMR, the code exits with segmentation fault (with the input provided, at cycle 332, code time 0.38). I am relatively new to Athena++ so any help/pointers are greatly appreciated.
Steps to reproduce
Configure:
python configure.py --prob test_eos_ejecta3 --coord cartesian --flux hllc --nghost 4 --grav mg -mpi -hdf5
Compile and run:
make clean; make
mpirun -n 40 bin/athena -i inputs/model.athinput time/tlim=2.0
Input files (placed in input
folder, and please remove .txt) :
donor_1p0_0p21_4p0_0.08.data.txt,
model.athinput.txt
Version info
- Athena++ version: 24.0
- Compiler and version: g++ 11.4.0
- Operating system: Rocky Linux 8.10
- Hardware and cluster name (if applicable):
- External library versions (if applicable): openmpi/4.0.7 , hdf5/mpi-1.10.9
We cannot tell what is causing the problem without seeing your code. I suggest you to run the code with gdb (or analyze dumped core file with it) to identify where it died. It is a bit tricky to run gdb with MPI but you can google it.
My apologies, I forgot to attach my problem generator
test_eos_ejecta3.cpp.txt
I will look into running gdb with MPI. Thank you for the suggestion.
I could not catch anything causing the segmentation fault, but I'm afraid that your boundary conditions probably cause another problems. You are directly accessing the passive scalar array in the boundary functions, but with AMR we need to apply the boundary conditions on what we call coarser buffer for AMR prolongation.
@c-white @felker do you remember the correct way to apply the boundary conditions on the scalar variables?
However, User-Defined Boundary Conditions are currently unsupported for
NSCALARS > 0
since there is noAthenaArray<Real> &r
parameter in the function signature
This cannot be hacked in the code in the way shown in the attached pgen file.
To elaborate, the user-defined boundary functions get called during the prolongation step of refinement in ApplyPhysicalBoundariesOnCoarseLevel()
, under function callstacks at sites like this one:
athena/src/bvals/bvals_refine.cpp
Lines 451 to 459 in 185473d
You'll note that ph->coarse_prim_
and other refinement-specific variable buffers are what are being used here / what the user-defined boundary condition is being applied to, not always ph->w
. So that is why hardcoding in your boundary condition functions like:
AthenaArray<Real> &prim_scalar = pmb->pscalars->r;
prim_scalar(0,k,j,i) = 0.0;
prim_scalar(1,k,j,i) = 0.0;
won't work. The function needs to be made generic enough to apply to a function parameter for e.g. ps->coarse_r_
.
Or you can follow @yanfeij's lead in #492 and have separate user-defined boundary functions for passive scalars like he made for the radiation intensity:
Lines 667 to 690 in 185473d
Since your user-defined boundary functions are mostly outflow, I would try hardcoding calling the built-in outflow functions for only the passive scalars in void BoundaryValues::DispatchBoundaryFunctions
and call your user-defined function on the other variables.
Thanks @tomidakn and @felker -- this was sort of working with an earlier version of the codebase, but perhaps that was just luck. At least there are a couple ways forward for fixing the passive scalar boundaries. @sunnywong314 I can help pursuing one of them. For this project, I'm inclined to do some quick and dirty pointer comparisons, so that only the pgen file needs to be modified, but I'll spend a little time having a closer look at the code.
@tomidakn @felker @c-white Many thanks for looking into this!
Passive scalars didn't cause the segmentation fault, but it is good to know that the hack in the boundary function doesn't work.
I removed all passive-scalar-related lines from the problem generator for clarity :
test_eos_ejecta3.cpp.txt
and scaled down the problem so that it runs faster
model.athinput.txt
I get a segmentation fault if I configure with :
python configure.py --prob test_eos_ejecta3 --coord cartesian --flux hllc --nghost 4 --grav mg -mpi
make clean; make
and run with :
mpirun -n 20 bin/athena -i inputs/model.athinput time/tlim=2
However, if I configure without MPI:
python configure.py --prob test_eos_ejecta3 --coord cartesian --flux hllc --nghost 4 --grav mg
and run with bin/athena -i inputs/model.athinput
then the segmentation fault goes away.
The segmentation fault also goes away if I configure with the -debug
option with MPI still on:
python configure.py --prob test_eos_ejecta3 --coord cartesian --flux hllc --nghost 4 --grav mg -mpi -debug
and run with mpirun -n 20 bin/athena -i inputs/model.athinput time/tlim=2
I tracked down 9d763ac as the first commit that gave me the segmentation fault. All earlier commits that I tested up to 2bd7c69 from Mar 2021 were alright.
I haven't learned how to run a debugger with MPI so I don't know which line of the code gave me the segmentation fault.
Here are the modules I have:
- modules/2.2-20230808 (S) 2) slurm (S) 3) gcc/11.4.0 4) openmpi/4.0.7 5) hdf5/mpi-1.10.9
The modules are the same at compile time and at run time.
mpicxx --version :
g++ (Spack GCC) 11.4.0
OK, it sounds like my fault. I'll take a look.
Can you try it with nghost=2?
nghost = 2 still gives the segmentation fault (note: previous runs used xorder = 3, and for this I change to xorder = 2)
I could reproduce your issue with g++ (8.5.0) + Intel MPI, but not with icpc (2023) + Intel MPI. So this issue seems to be g++ specific.
@sunnywong314 To try this on Popeye:
module load modules/2.3-20240529 intel-oneapi-compilers/2024.1.0 intel-oneapi-mpi/intel-2021.12.0 hdf5/intel-mpi-1.14.3
python configure.py --prob test_eos_ejecta3 --coord cartesian --flux hllc --nghost 4 --grav mg --cxx icpc -mpi -hdf5 --mpiccmd mpiicpx
In your submission script, try either srun
or mpirun
. Hopefully this runs smoothly, and it might even run faster.
I tested the latest code with g++ and Intel MPI but with another pgen and it ran smoothly. So I'm afraid this issue is very subtle but specific to your pgen.