bombela/backward-cpp

Non-deterministic segfault when enabling BACKWARD_HAS_DW and BACKWARD_HAS_BACKTRACE

eddy16112 opened this issue · 8 comments

I am using gcc 9.4 on linux with both BACKWARD_HAS_DW=1 and BACKWARD_HAS_BACKTRACE=1, and use single thread to call backward API, and then I occasionally catch a segfault from libdw. Here is the backtrace from gdb

Thread 7 "partitioning" received signal SIGSEGV, Segmentation fault.
0x00007ffff695f118 in __libdw_alloc_tail () from /scratch2/wwu/anaconda3/envs/py39/lib/libdw.so.1
(gdb) bt
#0  0x00007ffff695f118 in __libdw_alloc_tail () from /scratch2/wwu/anaconda3/envs/py39/lib/libdw.so.1
#1  0x00007ffff69569cd in read_srclines () from /scratch2/wwu/anaconda3/envs/py39/lib/libdw.so.1
#2  0x00007ffff6957d66 in __libdw_getsrclines () from /scratch2/wwu/anaconda3/envs/py39/lib/libdw.so.1
#3  0x00007ffff6957fff in dwarf_getsrclines () from /scratch2/wwu/anaconda3/envs/py39/lib/libdw.so.1
#4  0x00007ffff6958167 in dwarf_getsrc_die () from /scratch2/wwu/anaconda3/envs/py39/lib/libdw.so.1
#5  0x000055555707af65 in backward::TraceResolverLinuxImpl<backward::trace_resolver_tag::libdw>::resolve (this=0x55555bd5b820, trace=...) at /scratch2/wwu/legion-master/runtime/realm/backward.hpp:1908
#6  0x0000555557075760 in Realm::RuntimeImpl::realm_backtrace (signal=6) at /scratch2/wwu/legion-master/runtime/realm/runtime_impl.cc:2934

Any idea about the segfault?

The value from gdb seems to be valid:

(gdb) p cudie
$1 = (Dwarf_Die *) 0x7fffdc184c20
(gdb) p *cudie
$2 = {addr = 0x7fffca6665df, cu = 0x7fffdc6bd6b0, abbrev = 0x7fffdc6ccfd0, padding__ = 0}
(gdb) p/x trace_addr
$3 = 0x5555570755ee
(gdb) p/x mod_bias
$4 = 0x555555554000

Maybe try under valgrind. I wouldn't rule out a bug in libdw. You can also try a different backend, like libbfd to compare.

I tried DBACKWARD_HAS_BACKTRACE_SYMBOL , and it works. I can not get libbfd compiled, because it needs libiberty-dev, which can not be retrieved from conda.
The valgrind is a little bit tricky, because my program is a GASNet program and valgrind complains about GASNet and abort.

Get another backtrace, also segfault:

#6  0x00007f450f40c92a in dwarf_highpc () from /lib/x86_64-linux-gnu/libdw.so.1
#7  0x00007f450f40f868 in dwarf_ranges () from /lib/x86_64-linux-gnu/libdw.so.1
#8  0x000055f99ee3f410 in backward::TraceResolverLinuxImpl<backward::trace_resolver_tag::libdw>::die_has_pc (die=0x55f9a45555e0, pc=28448202) at /scratch2/wwu/legion-master/runtime/realm/new_backward.hpp:2000
#9  0x000055f99ee4180b in backward::TraceResolverLinuxImpl<backward::trace_resolver_tag::libdw>::deep_first_search_by_pc<backward::TraceResolverLinuxImpl<backward::trace_resolver_tag::libdw>::inliners_search_cb> (parent_die=0x55f9a4555670, pc=28448202, cb=...)
    at /scratch2/wwu/legion-master/runtime/realm/new_backward.hpp:2065
#10 0x000055f99ee417ea in backward::TraceResolverLinuxImpl<backward::trace_resolver_tag::libdw>::deep_first_search_by_pc<backward::TraceResolverLinuxImpl<backward::trace_resolver_tag::libdw>::inliners_search_cb> (parent_die=0x55f9a4555700, pc=28448202, cb=...)
    at /scratch2/wwu/legion-master/runtime/realm/new_backward.hpp:2062
#11 0x000055f99ee417ea in backward::TraceResolverLinuxImpl<backward::trace_resolver_tag::libdw>::deep_first_search_by_pc<backward::TraceResolverLinuxImpl<backward::trace_resolver_tag::libdw>::inliners_search_cb> (parent_die=0x7f44f89e81c0, pc=28448202, cb=...)
    at /scratch2/wwu/legion-master/runtime/realm/new_backward.hpp:2062
#12 0x000055f99ee3f022 in backward::TraceResolverLinuxImpl<backward::trace_resolver_tag::libdw>::resolve (this=0x55f9a45558a0, trace=...) at /scratch2/wwu/legion-master/runtime/realm/new_backward.hpp:1922

Another issue, BACKWARD_HAS_BACKTRACE and BACKWARD_HAS_UNWIND generate different backtrace.
BACKWARD_HAS_BACKTRACE (wrong, see frame 6)

#16   Source "../sysdeps/unix/sysv/linux/x86_64/clone.S", line 97, in __clone
#15   Source "/build/glibc-SzIz7B/glibc-2.31/nptl/pthread_create.c", line 478, in start_thread
#14   Source "/scratch2/wwu/legion-master/runtime/realm/threads.cc", line 784, in pthread_entry
#13   Source "/scratch2/wwu/legion-master/runtime/realm/threads.inl", line 98, in thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock>
#12   Source "/scratch2/wwu/legion-master/runtime/realm/tasks.cc", line 1216, in scheduler_loop_wlock
#11   Source "/scratch2/wwu/legion-master/runtime/realm/tasks.cc", line 1105, in scheduler_loop
#10   Source "/scratch2/wwu/legion-master/runtime/realm/tasks.cc", line 1367, in execute_task
#9    Source "/scratch2/wwu/legion-master/runtime/realm/tasks.cc", line 304, in execute_on_processor
#8    Source "/scratch2/wwu/legion-master/runtime/realm/proc_impl.cc", line 1129, in execute_task
#7    Source "/scratch2/wwu/legion-master/runtime/legion/legion.inl", line 21167, in legion_task_wrapper<daxpy_task>
#6    Source "/scratch2/wwu/legion-master/tutorial/07_partitioning/partitioning.cc", line 278, in check_task
#5    Object "/usr/lib/x86_64-linux-gnu/libc-2.31.so", at 0x7f714c1a8fd6, in 
#4  | Source "/build/glibc-SzIz7B/glibc-2.31/intl/loadmsgcat.c", line 970, in get_sysdep_segment_value
      Source "/build/glibc-SzIz7B/glibc-2.31/intl/loadmsgcat.c", line 509, in _nl_load_domain
#3    Source "/build/glibc-SzIz7B/glibc-2.31/stdlib/abort.c", line 81, in abort
#2    Source "../sysdeps/unix/sysv/linux/raise.c", line 51, in raise
#1    Object "/usr/lib/x86_64-linux-gnu/libpthread-2.31.so", at 0x7f714c6eb420, in __restore_rt
#0    Source "/scratch2/wwu/legion-master/runtime/realm/runtime_impl.cc", line 2876, in realm_backtrace
(py39) wwu@g0004:/scratch2/wwu/legion-master/tutorial/07_partitioning$ ./partitioning -ll:cpu 1 -ll:force_kthreads

BACKWARD_HAS_UNWIND (correct, see frame 8)

#19   Object "", at 0xffffffffffffffff, in 
#18   Source "../sysdeps/unix/sysv/linux/x86_64/clone.S", line 95, in __clone
#17   Source "/build/glibc-SzIz7B/glibc-2.31/nptl/pthread_create.c", line 477, in start_thread
#16   Source "/scratch2/wwu/legion-master/runtime/realm/threads.cc", line 781, in pthread_entry
#15   Source "/scratch2/wwu/legion-master/runtime/realm/threads.inl", line 97, in thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock>
#14   Source "/scratch2/wwu/legion-master/runtime/realm/tasks.cc", line 1217, in scheduler_loop_wlock
#13   Source "/scratch2/wwu/legion-master/runtime/realm/tasks.cc", line 1105, in scheduler_loop
#12   Source "/scratch2/wwu/legion-master/runtime/realm/tasks.cc", line 1366, in execute_task
#11   Source "/scratch2/wwu/legion-master/runtime/realm/tasks.cc", line 302, in execute_on_processor
#10   Source "/scratch2/wwu/legion-master/runtime/realm/proc_impl.cc", line 1129, in execute_task
#9    Source "/scratch2/wwu/legion-master/runtime/legion/legion.inl", line 21165, in legion_task_wrapper<daxpy_task>
#8    Source "/scratch2/wwu/legion-master/tutorial/07_partitioning/partitioning.cc", line 256, in daxpy_task
#7    Source "/build/glibc-SzIz7B/glibc-2.31/assert/assert.c", line 101, in __assert_fail
#6    Source "/build/glibc-SzIz7B/glibc-2.31/assert/assert.c", line 92, in __assert_fail_base
#5    Source "/build/glibc-SzIz7B/glibc-2.31/stdlib/abort.c", line 79, in abort
#4    Source "../sysdeps/unix/sysv/linux/raise.c", line 51, in raise
#3    Object "/usr/lib/x86_64-linux-gnu/libpthread-2.31.so", at 0x7f10d256341f, in 
#2    Source "/scratch2/wwu/legion-master/runtime/realm/runtime_impl.cc", line 2875, in realm_backtrace
#1    Source "/scratch2/wwu/legion-master/runtime/realm/new_backward.hpp", line 878, in load_here
#0    Source "/scratch2/wwu/legion-master/runtime/realm/new_backward.hpp", line 860, in unwind<backward::StackTraceImpl<backward::system_tag::linux_tag>::callback>

Interesting that Valgrind complains about GASNet. I only learned about GASNet now, and I do not understand what it does that upsets Valgrind.

Note that there is two level of dependencies in backward-cpp. Stack walking, and trace resolving.

Walking the stack find out which functions was traversed. You get the object file and memory address of the machine code. This is what you select with BACKWARD_HAS_UNWIND, BACKWARD_HAS_LIBUNWIND and BACKWARD_HAS_BACKTRACE (see https://github.com/bombela/backward-cpp/blob/master/backward.hpp#L98).

The second level is one of the library to read the debug symbols from the object files. Which is used to map object file to source file and line numbers.

In your two traces, unwind should give you the correct stack walking as it comes from the compiler itself.

Get another backtrace, also segfault:

#6  0x00007f450f40c92a in dwarf_highpc () from /lib/x86_64-linux-gnu/libdw.so.1
#7  0x00007f450f40f868 in dwarf_ranges () from /lib/x86_64-linux-gnu/libdw.so.1
#8  0x000055f99ee3f410 in backward::TraceResolverLinuxImpl<backward::trace_resolver_tag::libdw>::die_has_pc (die=0x55f9a45555e0, pc=28448202) at /scratch2/wwu/legion-master/runtime/realm/new_backward.hpp:2000
#9  0x000055f99ee4180b in backward::TraceResolverLinuxImpl<backward::trace_resolver_tag::libdw>::deep_first_search_by_pc<backward::TraceResolverLinuxImpl<backward::trace_resolver_tag::libdw>::inliners_search_cb> (parent_die=0x55f9a4555670, pc=28448202, cb=...)
    at /scratch2/wwu/legion-master/runtime/realm/new_backward.hpp:2065
#10 0x000055f99ee417ea in backward::TraceResolverLinuxImpl<backward::trace_resolver_tag::libdw>::deep_first_search_by_pc<backward::TraceResolverLinuxImpl<backward::trace_resolver_tag::libdw>::inliners_search_cb> (parent_die=0x55f9a4555700, pc=28448202, cb=...)
    at /scratch2/wwu/legion-master/runtime/realm/new_backward.hpp:2062
#11 0x000055f99ee417ea in backward::TraceResolverLinuxImpl<backward::trace_resolver_tag::libdw>::deep_first_search_by_pc<backward::TraceResolverLinuxImpl<backward::trace_resolver_tag::libdw>::inliners_search_cb> (parent_die=0x7f44f89e81c0, pc=28448202, cb=...)
    at /scratch2/wwu/legion-master/runtime/realm/new_backward.hpp:2062
#12 0x000055f99ee3f022 in backward::TraceResolverLinuxImpl<backward::trace_resolver_tag::libdw>::resolve (this=0x55f9a45558a0, trace=...) at /scratch2/wwu/legion-master/runtime/realm/new_backward.hpp:1922

Sometimes your traces are sorted in a different order. Is this intended?
Where are the frames 0 to 5? The segfault should be at frame 0 was it inside the dwarf library?

I figured out the problem. The stack allocated for handling the signal is too small, increasing the stack fixed the problem