MPAS (atmosphere) breaking at mpas_init
Closed this issue · 2 comments
Hello,
I'm having trouble with MPAS breaking at the mpas_init subroutine. It's very weird as the issue only appears in some systems and the error message is difficult to interpret:
$mpirun -np 24 ./atmos.exe
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1734831948.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
I contacted the OpenMPI team but it's hard to pinpoint the origin of that errorcode. After many tests trying to isolate the cause of the problem, I have ended with 2 systems running CentOS8 (4.18.0-240.10.1.el8_3.x86_64) & OpenMPI v4.1.0. The only difference between them is that the first runs on Intel Xeon Platinum 8259CL (2.50GHz) and the second on Intel Xeon Platinum 8252C (3.80GHz). I cannot put my finger on why upgrading the processor would cause that problem, but I have spent the last week looking at everything else without any progress.
Thanks.
P.S. NetCDF & PnetCDF were built from the master branch (also tried 4.7.4); PIO was 2.5.2.
Does compiling with DEBUG=true
provide any useful stack trace or other information? That the code is apparently calling MPI_Abort
rather than simply, e.g., segfaulting suggests the error may be something detectable by the code. Are there any other error messages in the stdout/stderr from the failed jobs, or in the log.atmosphere.XXXX.err
files?
The errorcode was produced because of a missing file (it was a far lighter problem than originally thought). My apologies because I completely forgot that this thread was open. Closing it.