Signal 11 Seg Fault at end of run
Closed this issue · 2 comments
Hello I am trying to do tests with OpenMPI v4.0.0 and was having issues with IMB v2019.1 release and was told by the OpenMPI devs to use this commit as a workaround: 841446d. This worked fine until the very end when it does what I'm guessing is a cleanup step and will seg fault on one or two machines. Is there any way to get more output for the end of the run? I tried using '-v' but I got nothing more out of it?
Command used:
mpirun -v --mca btl_openib_warn_no_device_params_found 0 --map-by node --mca orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca btl_openib_allow_ib 1 -np 8 -hostfile /home/aleblanc/ib-mpi-hosts IMB-MPI1
Output:
#----------------------------------------------------------------
# Benchmarking Bcast [1055/98325]
# #processes = 6
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.11 0.12 0.11
1 1000 1.72 7.06 4.86
2 1000 1.72 6.85 4.80
4 1000 1.71 6.92 4.78
8 1000 1.76 7.12 4.91
16 1000 1.76 7.18 4.89
32 1000 1.74 7.17 4.87
64 1000 1.81 7.58 5.13
128 1000 1.80 9.27 6.16
256 1000 1.84 9.54 6.34
512 1000 2.15 10.70 7.22
1024 1000 2.35 11.70 7.92
2048 1000 2.21 15.09 10.10
4096 1000 3.62 17.32 12.54
8192 1000 6.17 23.32 17.99
16384 1000 11.24 37.28 28.67
32768 1000 62.61 80.91 71.06
65536 640 109.31 131.24 120.22
131072 320 225.50 236.59 231.80
262144 160 430.89 449.17 442.21
524288 80 406.54 453.22 430.84
1048576 40 811.17 878.36 842.89
2097152 20 1788.67 1886.04 1824.92
4194304 10 2899.46 3183.22 3073.55
#---------------------------------------------------
# Benchmarking Barrier
# #processes = 2
# ( 4 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
#repetitions t_min[usec] t_max[usec] t_avg[usec]
1000 2.30 2.30 2.30
#---------------------------------------------------
# Benchmarking Barrier
# #processes = 4
# ( 2 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
#repetitions t_min[usec] t_max[usec] t_avg[usec]
1000 4.87 4.87 4.87
#---------------------------------------------------
# Benchmarking Barrier
# #processes = 6 [1008/98325]
#---------------------------------------------------
#repetitions t_min[usec] t_max[usec] t_avg[usec]
1000 8.54 8.54 8.54
# All processes entering MPI_Finalize
[titan:08194] *** Process received signal ***
[titan:08194] Signal: Segmentation fault (11)
[titan:08194] Signal code: Address not mapped (1)
[titan:08194] Failing at address: 0x10
[titan:08194] [ 0] /lib64/libpthread.so.0(+0xf680)[0x7f0218104680]
[titan:08194] [ 1] /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x2a865)[0x7f021777f865]
[titan:08194] [ 2] /opt/openmpi/4.0.0/lib/openmpi/mca_rcache_grdma.so(+0x1fd9)[0x7f020b9defd9]
[titan:08194] [ 3] /opt/openmpi/4.0.0/lib/libopen-pal.so.40(mca_rcache_base_module_destroy+0x8f)[0x7f021781d55f]
[titan:08194] [ 4] /opt/openmpi/4.0.0/lib/openmpi/mca_btl_openib.so(+0xeba7)[0x7f020ac73ba7]
[titan:08194] [ 5] /opt/openmpi/4.0.0/lib/openmpi/mca_btl_openib.so(mca_btl_openib_finalize+0x601)[0x7f020ac6ef91]
[titan:08194] [ 6] /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x76213)[0x7f02177cb213]
[titan:08194] [ 7] /opt/openmpi/4.0.0/lib/libopen-pal.so.40(mca_base_framework_close+0x79)[0x7f02177b5799]
[titan:08194] [ 8] /opt/openmpi/4.0.0/lib/libopen-pal.so.40(mca_base_framework_close+0x79)[0x7f02177b5799]
[titan:08194] [ 9] /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_mpi_finalize+0x86f)[0x7f0218367c1f]
[titan:08194] [10] IMB-MPI1[0x4025d4]
[titan:08194] [11] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f0217d473d5]
[titan:08194] [12] IMB-MPI1[0x401d59]
[titan:08194] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
[pandora:13903] *** Process received signal ***
[pandora:13903] Signal: Segmentation fault (11)
[pandora:13903] Signal code: Address not mapped (1)
[pandora:13903] Failing at address: 0x10
[pandora:13903] [ 0] /lib64/libpthread.so.0(+0xf680)[0x7f68ee599680]
[pandora:13903] [ 1] /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x2a865)[0x7f68edc14865]
[pandora:13903] [ 2] /opt/openmpi/4.0.0/lib/openmpi/mca_rcache_grdma.so(+0x1fd9)[0x7f68e1b8bfd9]
[pandora:13903] [ 3] /opt/openmpi/4.0.0/lib/libopen-pal.so.40(mca_rcache_base_module_destroy+0x8f)[0x7f68edcb255f]
[pandora:13903] [ 4] /opt/openmpi/4.0.0/lib/openmpi/mca_btl_openib.so(+0xeba7)[0x7f68e1548ba7]
[pandora:13903] [ 5] /opt/openmpi/4.0.0/lib/openmpi/mca_btl_openib.so(mca_btl_openib_finalize+0x601)[0x7f68e1543f91]
[pandora:13903] [ 6] /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x76213)[0x7f68edc60213]
[pandora:13903] [ 7] /opt/openmpi/4.0.0/lib/libopen-pal.so.40(mca_base_framework_close+0x79)[0x7f68edc4a799]
[pandora:13903] [ 8] /opt/openmpi/4.0.0/lib/libopen-pal.so.40(mca_base_framework_close+0x79)[0x7f68edc4a799]
[pandora:13903] [ 9] /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_mpi_finalize+0x86f)[0x7f68ee7fcc1f]
[pandora:13903] [10] IMB-MPI1[0x4025d4]
[pandora:13903] [11] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f68ee1dc3d5]
[pandora:13903] [12] IMB-MPI1[0x401d59]
[pandora:13903] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 8194 on node titan-ib exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
I was just notified that this issue could be threading related and was told that it might be difficult to reproduce. I would like to get this issue resolved as soon as possible since I am doing testing for the OpenFabrics Alliance at the UNH-IOL. I am trying to get all of their testing done and this is the last thing that needs to be addressed. If you would like to VPN into our cluster to try and solve this issue faster, you can contact me at aleblanc@iol.unh.edu.
Thank you and I hope to hear from you guys soon.
Hello @titanlock
Thank you for your interest in IMB and sorry for delay.
IMB have no any options to get more output, so you should ask OpenMPI team about it.