MatMechLab/AsFem

PETSc error in branch devel

Closed this issue · 4 comments

[1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[1]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind
[1]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple MacOS to find memory corruption errors
[1]PETSC ERROR: likely location of problem given in stack below
[1]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
[1]PETSC ERROR: No error traceback is available, the problem could be in the main program.
[1]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[1]PETSC ERROR: Signal received
[1]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[1]PETSC ERROR: Petsc Release Version 3.17.2, Jun 02, 2022
[1]PETSC ERROR: /thfs1/home/liujinmei/AsFem/bin/asfem on a  named cn537 by liujinmei Wed Oct 26 11:44:02 2022
[1]PETSC ERROR: [2]PETSC ERROR: ------------------------------------------------------------------------
[2]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[2]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[2]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind
[2]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple MacOS to find memory corruption errors
[2]PETSC ERROR: likely location of problem given in stack below
[2]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
[2]PETSC ERROR: No error traceback is available, the problem could be in the main program.
[2]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------

@yangbai90 I think this problem is about memory.

I modify the CMakeLists.txt.Add some args-fsanitize=undefined,address and change other args-O2->-Og,-Werror->none to CMAKE_CXX_FLAGS.
I use this command find / -name "libasan.so" to get the location of "libasan.so", then

LD_PRELOAD=/usr/lib/gcc/x86_64-linux-gnu/5/libasan.so ./asfem --version. The result following:

liujinmei@ln0:~/AsFem/bin$ LD_PRELOAD=/thfs1/home/liujinmei/software/gcc/gcc12.2/lib64/libasan.so ./asfem --version
Abort(1090831) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(159):
MPID_Init(509).......:
MPIR_pmi_init(91)....: PMIX_Init returned -25
[ln0:800040:0:800040] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid: 800040) ====
 0  /usr/local/ucx/lib/libucs.so.0(ucs_debug_print_backtrace+0x1c) [0x40003fd9712c]
 1  /usr/local/ucx/lib/libucs.so.0(ucs_handle_error+0x250) [0x40003fd993d0]
 2  /usr/local/ucx/lib/libucs.so.0(+0x26530) [0x40003fd99530]
 3  /usr/local/ucx/lib/libucs.so.0(+0x268c0) [0x40003fd998c0]
 4  linux-vdso.so.1(__kernel_rt_sigreturn+0) [0x40003bf305b8]
 5  /thfs1/software/mpich/mpi-x-gcc9.3.0/lib/libmpi.so.12(MPIR_Err_return_comm+0x78) [0x40003c9adaa8]
 6  /thfs1/home/liujinmei/software/petsc/3.17.2/lib/libpetsc.so.3.17(PetscInitialize+0x19c) [0x40003d511ebc]
 7  ./asfem() [0x40acec]
 8  /lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0xe8) [0x40003fa00090]
 9  ./asfem() [0x409e84]
=================================
AddressSanitizer:DEADLYSIGNAL
=================================================================
==800040==ERROR: AddressSanitizer: SEGV on unknown address 0x2a47000c3528 (pc 0x40003c9adaa8 bp 0xfffff6bbdf40 sp 0xfffff6bbdf40 T0)
==800040==The signal is caused by a READ memory access.
    #0 0x40003c9adaa8 in MPIR_Err_return_comm (/thfs1/software/mpich/mpi-x-gcc9.3.0/lib/libmpi.so.12+0x2e7aa8)
    #1 0x40003d511eb8 in PetscInitialize (/thfs1/home/liujinmei/software/petsc/3.17.2/lib/libpetsc.so.3.17+0x1d2eb8)
    #2 0x40ace8 in main ../src/main.cpp:24
    #3 0x40003fa0008c in __libc_start_main (/lib/aarch64-linux-gnu/libc.so.6+0x2408c)
    #4 0x409e80  (/thfs1/home/liujinmei/AsFem/bin/asfem+0x409e80)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/thfs1/software/mpich/mpi-x-gcc9.3.0/lib/libmpi.so.12+0x2e7aa8) in MPIR_Err_return_comm
==800040==ABORTING

cpu:arrch64
system:ubuntu
compiler:gcc12.2
mpi: mpich/mpi-x-gcc9.3.0
petsc:3.17.2
finded this issue.

The issue's reason may be found, in which the openmpi and petsc must be compiled by the same compiler's version.
This issue has been solved. Thanks for your help!