Cosmoglobe/Commander

Commander crush when attempting first run

Opened this issue · 0 comments

Hi all,

I am running Commander3 on a cluster. I compiled current master branch with intel compilers. I attempt to run a tutorial parameter file but the job is terminated with the error attached below.

Abort(1090959) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(176)........: 
MPID_Init(1548)..............: 
MPIDI_OFI_mpi_init_hook(1554): 
(unknown)(): Other MPI error
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1090959
:
system msg for write_line failure : Bad file descriptor
Abort(1090959) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(176)........: 
MPID_Init(1548)..............: 
MPIDI_OFI_mpi_init_hook(1554): 
(unknown)(): Other MPI error
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1090959
:
system msg for write_line failure : Bad file descriptor
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
libpthread-2.31.s  0000152B82E37420  Unknown               Unknown  Unknown
libmpi.so.12.0.0   0000152B81233BE1  MPIR_Err_return_c     Unknown  Unknown
libmpi.so.12.0.0   0000152B813D9ED0  MPI_Init              Unknown  Unknown
libmpifort.so.12.  0000152B829D748B  PMPI_INIT             Unknown  Unknown
commander3         000000000049276A  MAIN__                     77  commander.f90
commander3         00000000004923BD  Unknown               Unknown  Unknown
libc-2.31.so       0000152B806DD083  __libc_start_main     Unknown  Unknown
commander3         00000000004922DE  Unknown               Unknown  Unknown
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[28425,1],0]
  Exit code:    174
--------------------------------------------------------------------------

The parameter file is from BP10 branch. And the version of MPI I was using is mpirun (Open MPI) 4.1.5 I did not modify much but only change the path of output and data path. Is it an error related to MPI or running out of memory on my cluster? Looking forward to any help.