darwin builds break on interrupt test with -O2
Closed this issue · 4 comments
for some reason we need to set osx builds to -O0
otherwise we get a bus error: 10
. We need to get on an osx machine and try this out with debug symbols and and have gdb spit out info on where its happening. because everything is wrapped in libtool and because libtool on osx isnt the same thing... we need to disable shared libs so we can just use gdb directly.
I can confirm that test/interrupt
crashes under -O2
, but not under -O0
on macOS.
Here's the stack trace for this bus error:
(lldb) run
Process 23421 launched: './test/.libs/interrupt' (x86_64)
=== Testing interrupt ===
test_early [PASS]
test_loopProcess 23421 stopped
* thread #2, stop reason = EXC_BAD_ACCESS (code=2, address=0x7000051c8d00)
frame #0: 0x00007000051c8d00
-> 0x7000051c8d00: rorb 0x7000051c(%rsi)
0x7000051c8d06: addb %al, (%rax)
0x7000051c8d08: xchgl %edi, %eax
0x7000051c8d09: movsb (%rsi), %es:(%rdi)
Target 0: (interrupt) stopped.
(lldb) bt
* thread #2, stop reason = EXC_BAD_ACCESS (code=2, address=0x7000051c8d00)
* frame #0: 0x00007000051c8d00
frame #1: 0x00000001000db063 libprime_server.0.dylib`std::__1::function<prime_server::worker_t::result_t (std::__1::list<zmq::message_t, std::__1::allocator<zmq::message_t> > const&, void*, std::__1::function<void ()>&)>::operator(this=<unavailable>, __arg=<unavailable>, __arg=0x0000000000000001, __arg=<unavailable>)(std::__1::list<zmq::message_t, std::__1::allocator<zmq::message_t> > const&, void*, std::__1::function<void ()>&) const at functional:1921 [opt]
frame #2: 0x00000001000db063 libprime_server.0.dylib`std::__1::function<prime_server::worker_t::result_t (std::__1::list<zmq::message_t, std::__1::allocator<zmq::message_t> > const&, void*, std::__1::function<void ()>&)>::operator(this=<unavailable>, __arg=<unavailable>, __arg=0x00000001004048a8, __arg=<unavailable>)(std::__1::list<zmq::message_t, std::__1::allocator<zmq::message_t> > const&, void*, std::__1::function<void ()>&) const at functional:1921 [opt]
frame #3: 0x00000001000da497 libprime_server.0.dylib`prime_server::worker_t::work(this=0x000000010030f0b0) at prime_server.cpp:407 [opt]
frame #4: 0x00000001000077fc interrupt`void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::__bind<void (prime_server::worker_t::*)(), (anonymous namespace)::testable_worker_t> > >(void*) [inlined] decltype(__f=<unavailable>)::testable_worker_t&>(fp0).*fp(std::__1::forward<>(fp1))) std::__1::__invoke<void (prime_server::worker_t::*&)(), (anonymous namespace)::testable_worker_t&, void>(void (prime_server::worker_t::*&&&)(), (anonymous namespace)::testable_worker_t&&&) at type_traits:4236 [opt]
frame #5: 0x00000001000077e1 interrupt`void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::__bind<void (prime_server::worker_t::*)(), (anonymous namespace)::testable_worker_t> > >(void*) [inlined] std::__1::__bind_return<void (prime_server::worker_t::*)(), std::__1::tuple<(anonymous namespace)::testable_worker_t>, std::__1::tuple<>, __is_valid_bind_return<void (prime_server::worker_t::*)(), std::__1::tuple<(anonymous namespace)::testable_worker_t>, std::__1::tuple<> >::value>::type std::__1::__apply_functor<void (prime_server::worker_t::*)(), std::__1::tuple<(anonymous namespace)::testable_worker_t>, 0ul, std::__1::tuple<> >(void (prime_server::worker_t::*&)(), std::__1::tuple<(anonymous namespace)::testable_worker_t>&, std::__1::__tuple_indices<0ul>, std::__1::tuple<>&&) at functional:2224 [opt]
frame #6: 0x00000001000077e1 interrupt`void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::__bind<void (prime_server::worker_t::*)(), (anonymous namespace)::testable_worker_t> > >(void*) [inlined] std::__1::__bind_return<void (prime_server::worker_t::*)(), std::__1::tuple<(anonymous namespace)::testable_worker_t>, std::__1::tuple<>, __is_valid_bind_return<void (prime_server::worker_t::*)(), std::__1::tuple<(anonymous namespace)::testable_worker_t>, std::__1::tuple<> >::value>::type std::__1::__bind<void (this=<unavailable>)(), (anonymous namespace)::testable_worker_t>::operator()<>() at functional:2257 [opt]
frame #7: 0x00000001000077e1 interrupt`void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::__bind<void (prime_server::worker_t::*)(), (anonymous namespace)::testable_worker_t> > >(void*) [inlined] decltype(__f=<unavailable>)(), (anonymous namespace)::testable_worker_t> >(fp)(std::__1::forward<>(fp0))) std::__1::__invoke<std::__1::__bind<void (prime_server::worker_t::*)(), (anonymous namespace)::testable_worker_t> >(std::__1::__bind<void (prime_server::worker_t::*)(), (anonymous namespace)::testable_worker_t>&&) at type_traits:4323 [opt]
frame #8: 0x00000001000077e1 interrupt`void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::__bind<void (prime_server::worker_t::*)(), (anonymous namespace)::testable_worker_t> > >(void*) [inlined] void std::__1::__thread_execute<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::__bind<void (prime_server::worker_t::*)(), (anonymous namespace)::testable_worker_t> >(__t=<unavailable>)(), (anonymous namespace)::testable_worker_t> >&, std::__1::__tuple_indices<>) at thread:342 [opt]
frame #9: 0x00000001000077e1 interrupt`void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::__bind<void (prime_server::worker_t::*)(), (anonymous namespace)::testable_worker_t> > >(__vp=0x000000010030f090) at thread:352 [opt]
frame #10: 0x00007fff6a9f9661 libsystem_pthread.dylib`_pthread_body + 340
frame #11: 0x00007fff6a9f950d libsystem_pthread.dylib`_pthread_start + 377
frame #12: 0x00007fff6a9f8bf9 libsystem_pthread.dylib`thread_start + 13
And the prime_server.cc:407
frame looks like this (unfortunately missing some interesting stuff):
(lldb) up
frame #3: 0x00000001000da497 libprime_server.0.dylib`prime_server::worker_t::work(this=0x0000000100305d90) at prime_server.cpp:407 [opt]
404 job = *static_cast<const uint64_t*>(request_info.data());
405 handle_interrupt(true);
406 //do the work
-> 407 auto result = work_function(messages, request_info.data(), bail);
408 //we'll keep advertising with this heartbeat
409 heart_beat = std::move(result.heart_beat);
410 //should we send this on to the next proxy
(lldb) fr v
error: libprime_server_la-prime_server.o DWARF DIE at 0x000006fd (class worker_t) has a member variable 0x00000760 (cleanup_function) whose type is a forward declaration, not a complete definition.
Try compiling the source file with -fno-limit-debug-info
((anonymous namespace)::testable_worker_t *) this = 0x0000000100305d90
(prime_server::worker_t::interrupt_function_t) bail = 0x0000000100603250
(zmq::pollitem_t [2]) items = {
[0] = (socket = 0x0000000101808c00, fd = 0, events = 1, revents = 1)
[1] = (socket = 0x0000000100807c00, fd = 0, events = 1, revents = 0)
}
(const std::exception &) e = <variable not available>
(const (anonymous namespace)::interrupt_t &) i = <variable not available>
(const std::exception &) e = <variable not available>
(zmq::message_t) request_info = {
ptr = std::__1::shared_ptr<zmq_msg_t>::element_type @ 0x000000010030bee0 strong=1 weak=1 {
__ptr_ = 0x000000010030bee0
}
}
(std::__1::list<zmq::message_t, std::__1::allocator<zmq::message_t> >) messages = <variable not available>
(prime_server::worker_t::result_t) result = <variable not available>
I'm not sure what's up with the:
error: libprime_server_la-prime_server.o DWARF DIE at 0x000006fd (class worker_t) has a member variable 0x00000760 (cleanup_function) whose type is a forward declaration, not a complete definition.
message - I tried adding -fno-limit-debug-info
, but it had no effect.
@danpat thanks for the info! i dont know really anything about lldb
but the error you are showing here makes me think that the compiler doesnt see the cleanup function is defined!? the cleanup function is given the default value of [](){}
which is a lambda that returns void
and has no arguments. one quick check i can try and see what CI says is to make that more explicit. i'll give this a go when i have a free moment.
so this bug is still floating around... seems to be something quite odd. in look around for stop reason = EXC_BAD_INSTRUCTION
i found some information that suggested there is some undefined behavior going on and that the ud2
instruction is a trap the compiler puts in to signal that. so i turned on sanitizers and i see this:
(lldb) run
Process 2897 launched: '/Users/distiller/project/prime_server/build/interrupt' (x86_64)
=== Testing interrupt ===
test_early [PASS]
Process 2897 stopped
* thread #2, stop reason = Insufficient object size
frame #0: 0x00000001007d13d0 libclang_rt.asan_osx_dynamic.dylib`__ubsan_on_report
libclang_rt.asan_osx_dynamic.dylib`__ubsan_on_report:
-> 0x1007d13d0 <+0>: pushq %rbp
0x1007d13d1 <+1>: movq %rsp, %rbp
0x1007d13d4 <+4>: popq %rbp
0x1007d13d5 <+5>: retq
Target 0: (interrupt) stopped.
(lldb) bt
* thread #2, stop reason = Insufficient object size
* frame #0: 0x00000001007d13d0 libclang_rt.asan_osx_dynamic.dylib`__ubsan_on_report
frame #1: 0x00000001007cb8b9 libclang_rt.asan_osx_dynamic.dylib`__ubsan::Diag::~Diag() + 217
frame #2: 0x00000001007cd0f7 libclang_rt.asan_osx_dynamic.dylib`handleTypeMismatchImpl(__ubsan::TypeMismatchData*, unsigned long, __ubsan::ReportOptions) + 1255
frame #3: 0x00000001007ccbfc libclang_rt.asan_osx_dynamic.dylib`__ubsan_handle_type_mismatch_v1 + 60
frame #4: 0x0000000100035d39 interrupt`decltype(__f=<unavailable>, __args=size=1, __args=0x000070000661a420, __args=0x000070000661a500)::test_early()::$_0&>(fp)(std::__1::forward<std::__1::list<zmq::message_t, std::__1::allocator<zmq::message_t> > const&>(fp0), std::__1::forward<void*>(fp0), std::__1::forward<std::__1::function<void ()>&>(fp0))) std::__1::__invoke<(anonymous namespace)::test_early()::$_0&, std::__1::list<zmq::message_t, std::__1::allocator<zmq::message_t> > const&, void*, std::__1::function<void ()>&>((anonymous namespace)::test_early()::$_0&, std::__1::list<zmq::message_t, std::__1::allocator<zmq::message_t> > const&, void*&&, std::__1::function<void ()>&) at type_traits:4361:1 [opt]
frame #5: 0x0000000100035c6c interrupt`prime_server::worker_t::result_t std::__1::__invoke_void_return_wrapper<prime_server::worker_t::result_t>::__call<(anonymous namespace)::test_early(__args=0x00006130000010b8, __args=size=1, __args=0x000070000661a420, __args=0x000070000661a500)::$_0&, std::__1::list<zmq::message_t, std::__1::allocator<zmq::message_t> > const&, void*, std::__1::function<void ()>&>((anonymous namespace)::test_early()::$_0&, std::__1::list<zmq::message_t, std::__1::allocator<zmq::message_t> > const&, void*&&, std::__1::function<void ()>&) at __functional_base:318:16 [opt]
frame #6: 0x0000000100035ae3 interrupt`std::__1::__function::__alloc_func<(anonymous namespace)::test_early()::$_0, std::__1::allocator<(anonymous namespace)::test_early()::$_0>, prime_server::worker_t::result_t (std::__1::list<zmq::message_t, std::__1::allocator<zmq::message_t> > const&, void*, std::__1::function<void ()>&)>::operator(this=0x00006130000010b8, __arg=size=1, __arg=0x000070000661a420, __arg=0x000070000661a500)(std::__1::list<zmq::message_t, std::__1::allocator<zmq::message_t> > const&, void*&&, std::__1::function<void ()>&) at functional:1527:16 [opt]
frame #7: 0x0000000100035651 interrupt`std::__1::__function::__func<(anonymous namespace)::test_early()::$_0, std::__1::allocator<(anonymous namespace)::test_early()::$_0>, prime_server::worker_t::result_t (std::__1::list<zmq::message_t, std::__1::allocator<zmq::message_t> > const&, void*, std::__1::function<void ()>&)>::operator(this=0x00006130000010b8, __arg=size=1, __arg=0x000070000661a420, __arg=0x000070000661a500)(std::__1::list<zmq::message_t, std::__1::allocator<zmq::message_t> > const&, void*&&, std::__1::function<void ()>&) at functional:1651:12 [opt]
frame #8: 0x00000001002ba8d8 libprime_server.0.dylib`std::__1::function<prime_server::worker_t::result_t (std::__1::list<zmq::message_t, std::__1::allocator<zmq::message_t> > const&, void*, std::__1::function<void ()>&)>::operator()(std::__1::list<zmq::message_t, std::__1::allocator<zmq::message_t> > const&, void*, std::__1::function<void ()>&) const [inlined] std::__1::__function::__value_func<prime_server::worker_t::result_t (std::__1::list<zmq::message_t, std::__1::allocator<zmq::message_t> > const&, void*, std::__1::function<void ()>&)>::operator(this=<unavailable>, __args=size=1, __args=<unavailable>, __args=0x000070000661a500)(std::__1::list<zmq::message_t, std::__1::allocator<zmq::message_t> > const&, void*&&, std::__1::function<void ()>&) const at functional:1799:16 [opt]
frame #9: 0x00000001002ba89d libprime_server.0.dylib`std::__1::function<prime_server::worker_t::result_t (std::__1::list<zmq::message_t, std::__1::allocator<zmq::message_t> > const&, void*, std::__1::function<void ()>&)>::operator(this=<unavailable>, __arg=size=1, __arg=<unavailable>, __arg=0x000070000661a500)(std::__1::list<zmq::message_t, std::__1::allocator<zmq::message_t> > const&, void*, std::__1::function<void ()>&) const at functional:2347 [opt]
frame #10: 0x00000001002b86f0 libprime_server.0.dylib`prime_server::worker_t::work(this=<unavailable>) at prime_server.cpp:517:23 [opt]
So its still stopping at the same place, the work function, but it seems to be 2 instructions before the "trap" the compiler adds in non sanitizer mode. to me it looks like its got something to do with std::bind on the member function for handling interrupts.. since it binds to the member function but the detached thread that owns the worker is still referenced in the work function maybe there is some kind of double free on that bound interrupt member function or something...
yep that was it. i was instantiating a lambda in the tests which was just a worker method that was an endless loop, this meant that the thread that was detached and runs until the program terminates wants to keep calling that lambda. but the scope of that lambda is defined in the function that starts the thread. when that function finishes the lambda goes out of scope and boom the detached thread tries to call it and blows up. i have this fixed.. PR coming