jll63/yomm2

error in runtime destructor

Closed this issue · 15 comments

HI @jll63,

Thanks for the lib again, still playing with it. Seeing this error in runtime destructor, do you know what may cause this? (not repro debug build on mac, haven't done debugging it on linux server yet assuming this is something wrong related to runtime struct for certain case)

*** Error in `./NodeServer': free(): corrupted unsorted chunks: 0x000000000155b170 ***
*** Aborted at 1563204608 (unix time) try "date -d @1563204608" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGABRT (@0x3bb20000227b) received by PID 8827 (TID 0x7f8a872cf080) from PID 8827; stack trace: ***
@ 0x7f8a86cb6330 (unknown)
@ 0x7f8a860dfc37 gsignal
@ 0x7f8a860e3028 abort
@ 0x7f8a8611c2a4 (unknown)
@ 0x7f8a8612882e (unknown)
@ 0x698710 __gnu_cxx::new_allocator<>::deallocate()
@ 0x695988 std::allocator_traits<>::deallocate()
@ 0x690eee std::_Vector_base<>::_M_deallocate()
@ 0x68b38f std::_Vector_base<>::~_Vector_base()
@ 0x687785 std::vector<>::~vector()
@ 0x687120 yorel::yomm2::detail::runtime::~runtime()
@ 0x681898 yorel::yomm2::detail::update_methods()
@ 0x6862f5 yorel::yomm2::update_methods()
@ 0x41ebdd RunServer()
@ 0x41abda main

call stack in GDB:
(gdb) bt
#0 0x00007ffff6df1c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007ffff6df5028 in __GI_abort () at abort.c:89
#2 0x00007ffff6e2e2a4 in __libc_message (do_abort=do_abort@entry=1, fmt=fmt@entry=0x7ffff6f40350 "*** Error in `%s': %s: 0x%s \n")
at ../sysdeps/posix/libc_fatal.c:175
#3 0x00007ffff6e3a82e in malloc_printerr (ptr=, str=0x7ffff6f404a0 "free(): corrupted unsorted chunks", action=1) at malloc.c:4998
#4 _int_free (av=, p=, have_lock=0) at malloc.c:3842
#5 0x00000000006b269a in __gnu_cxx::new_allocatoryorel::yomm2::detail::rt_method::deallocate(yorel::yomm2::detail::rt_method
, unsigned long) ()
#6 0x00000000006b011c in std::allocator_traits<std::allocatoryorel::yomm2::detail::rt_method >::deallocate(std::allocatoryorel::yomm2::detail::rt_method&, yorel::yomm2::detail::rt_method
, unsigned long) ()
#7 0x00000000006abcec in std::_Vector_base<yorel::yomm2::detail::rt_method, std::allocatoryorel::yomm2::detail::rt_method >::_M_deallocate(yorel::yomm2::detail::rt_method
, unsigned long) ()
#8 0x00000000006a6703 in std::_Vector_base<yorel::yomm2::detail::rt_method, std::allocatoryorel::yomm2::detail::rt_method >::~_Vector_base() ()
#9 0x00000000006a2f8b in std::vector<yorel::yomm2::detail::rt_method, std::allocatoryorel::yomm2::detail::rt_method >::~vector() ()
#10 0x00000000006a29c6 in yorel::yomm2::detail::runtime::~runtime() ()
#11 0x000000000069d306 in yorel::yomm2::detail::update_methods(yorel::yomm2::detail::registry const&, yorel::yomm2::detail::dispatch_data&) ()
#12 0x00000000006a1d63 in yorel::yomm2::update_methods() ()
#13 0x0000000000413622 in main (argc=1, argv=0x7fffffffe808) at /home/shawncao/nebula/src/service/node/NodeServer.cpp:178

jll63 commented

Thanks.

Can you run your program with env variable YOMM2_ENABLE_TRACE=1 and post the output please?

Also it would help if I had a skeleton of all the classes and methods. I don't need all the code, just the class declarations (class body can be left empty) and the corresponding calls to register_class, declare_method and define_method (again method body is not needed).

Thanks @jll63

This is the header file defines all register_class, define_methods,
https://github.com/shawncao/nebula/blob/master/src/execution/serde/RowCursorSerde.h

This is the place update_methods get called https://github.com/shawncao/nebula/blob/master/src/service/node/NodeServer.cpp#L159
basically the main entry.

Just update on what I have found:
If I move open_methods call into NodeServerImpl constructor, then it works fine without crash,
is it namespace issue (since register_class/define_method called inside namespace nebula::execution::serde)?

NodeServerImpl() {
// We're using AOP lib yomm2 to inject batch serialiation
// Since we don't use dynamic library loading, we call this once at starting point.
// TODO(cao) - crashes node server, need to figure out root cause before executing query
yorel::yomm2::update_methods();
}

Yeah, seems like if I make this call inside our namespace rather than in main(), it will work fine, you may already know why...

Small updates in above second link
....
void updateOpenMethods() {
yorel::yomm2::update_methods();
}

} // namespace service
} // namespace nebula

void RunServer() {
// update_methods needs to be called inside our namespace, otherwise it will crash.
nebula::service::updateOpenMethods();
...

jll63 commented

Well...this is very weird. The namespace should not matter the least, in fact update_methods is supposed to be called from main.

Can you try to put update_methods back where it was, but this time call it like this: ::yorel::yomm2::update_methods(); (note the :: at the beginning).

Does your code have a lot of dependencies? I may try to build it tonight.

Also, please do try YOMM2_ENABLE_TRACE=1; and it's also fun to look at ;-)

Sorry this is the trace when it fails, I'm not sure if above "namespace moving" is really fixing the issue:

Register nebula::surface::RowCursor with &typeid 0xa00fd0
Register nebula::execution::core::BlockExecutor with &typeid 0xa01008
Register nebula::execution::core::SamplesExecutor with &typeid 0xa01050
Register nebula::memory::keyed::FlatRowCursor with &typeid 0xa010a8
Register nebula::surface::CompositeRowCursor with &typeid 0xa01138
Register nebula::surface::MockRowCursor with &typeid 0xa010e8
Register method asBuffer(yorel::yomm2::virtual_nebula::surface::RowCursor&, nebula::type::Schema)
asBuffer(yorel::yomm2::virtual_nebula::surface::RowCursor&, nebula::type::Schema): add spec (nebula::surface::RowCursor & cursor, nebula::type::Schema schema)
asBuffer(yorel::yomm2::virtual_nebula::surface::RowCursor&, nebula::type::Schema): add spec (nebula::execution::core::BlockExecutor & b, nebula::type::Schema)
asBuffer(yorel::yomm2::virtual_nebula::surface::RowCursor&, nebula::type::Schema): add spec (nebula::memory::keyed::FlatRowCursor & f, nebula::type::Schema)
Layering...
nebula::surface::RowCursor
nebula::execution::core::BlockExecutor nebula::execution::core::SamplesExecutor nebula::memory::keyed::FlatRowCursor nebula::surface::CompositeRowCursor nebula::surface::MockRowCursor
Allocating slots...
nebula::surface::RowCursor...
for asBuffer(yorel::yomm2::virtual_nebula::surface::RowCursor&, nebula::type::Schema)#0: 0 also in
nebula::execution::core::BlockExecutor
nebula::execution::core::SamplesExecutor
nebula::memory::keyed::FlatRowCursor
nebula::surface::CompositeRowCursor
nebula::surface::MockRowCursor
Building dispatch table for asBuffer(yorel::yomm2::virtual_nebula::surface::RowCursor&, nebula::type::Schema)
make groups for param #0, class nebula::surface::RowCursor
specs applicable to nebula::surface::MockRowCursor
(nebula::surface::RowCursor & cursor, nebula::type::Schema schema)
bit mask = 001
specs applicable to nebula::surface::CompositeRowCursor
(nebula::surface::RowCursor & cursor, nebula::type::Schema schema)
bit mask = 001
specs applicable to nebula::memory::keyed::FlatRowCursor
(nebula::surface::RowCursor & cursor, nebula::type::Schema schema)
(nebula::memory::keyed::FlatRowCursor & f, nebula::type::Schema)
bit mask = 101
specs applicable to nebula::execution::core::SamplesExecutor
(nebula::surface::RowCursor & cursor, nebula::type::Schema schema)
bit mask = 001
specs applicable to nebula::surface::RowCursor
(nebula::surface::RowCursor & cursor, nebula::type::Schema schema)
bit mask = 001
specs applicable to nebula::execution::core::BlockExecutor
(nebula::surface::RowCursor & cursor, nebula::type::Schema schema)
(nebula::execution::core::BlockExecutor & b, nebula::type::Schema)
bit mask = 011
groups for dim 0:
group 0/0 mask 001 nebula::surface::MockRowCursor nebula::surface::CompositeRowCursor nebula::execution::core::SamplesExecutor nebula::surface::RowCursor
group 0/1 mask 011 nebula::execution::core::BlockExecutor
group 0/2 mask 101 nebula::memory::keyed::FlatRowCursor
assign specs
group 0/0 mask 001 nebula::surface::MockRowCursor nebula::surface::CompositeRowCursor nebula::execution::core::SamplesExecutor nebula::surface::RowCursor
select best of:
(nebula::surface::RowCursor & cursor, nebula::type::Schema schema)
(nebula::surface::RowCursor & cursor, nebula::type::Schema schema): pf = 0x41cb70
group 0/1 mask 011 nebula::execution::core::BlockExecutor
select best of:
(nebula::surface::RowCursor & cursor, nebula::type::Schema schema)
(nebula::execution::core::BlockExecutor & b, nebula::type::Schema)
(nebula::execution::core::BlockExecutor & b, nebula::type::Schema): pf = 0x41cb40
group 0/2 mask 101 nebula::memory::keyed::FlatRowCursor
select best of:
(nebula::surface::RowCursor & cursor, nebula::type::Schema schema)
(nebula::memory::keyed::FlatRowCursor & f, nebula::type::Schema)
(nebula::memory::keyed::FlatRowCursor & f, nebula::type::Schema): pf = 0x41ec50
assign next
(nebula::surface::RowCursor & cursor, nebula::type::Schema schema):
select best of:
-> none
(nebula::execution::core::BlockExecutor & b, nebula::type::Schema):
select best of:
(nebula::surface::RowCursor & cursor, nebula::type::Schema schema)
-> (nebula::surface::RowCursor & cursor, nebula::type::Schema schema)
(nebula::memory::keyed::FlatRowCursor & f, nebula::type::Schema):
select best of:
(nebula::surface::RowCursor & cursor, nebula::type::Schema schema)
-> (nebula::surface::RowCursor & cursor, nebula::type::Schema schema)
Finding hash factor for 6 ti*
trying with M = 3, 8 buckets
found 1523255767835814935 after 5 attempts and 0.02152 msecs
Initializing global vector at 0xfd97f0
0 pointer to control table
1 hash table
9 control table
17 asBuffer(yorel::yomm2::virtual_nebula::surface::RowCursor&, nebula::type::Schema)
17 mtbl for nebula::surface::RowCursor: 0xfd9878
18 mtbl for nebula::execution::core::BlockExecutor: 0xfd9880
19 mtbl for nebula::execution::core::SamplesExecutor: 0xfd9888
20 mtbl for nebula::memory::keyed::FlatRowCursor: 0xfd9890
21 mtbl for nebula::surface::CompositeRowCursor: 0xfd9898
22 mtbl for nebula::surface::MockRowCursor: 0xfd98a0
23 end
Optimizing
asBuffer(yorel::yomm2::virtual_nebula::surface::RowCursor&, nebula::type::Schema)
nebula::surface::MockRowCursor.mtbl[0] = 0x41cb70 (function)
nebula::surface::CompositeRowCursor.mtbl[0] = 0x41cb70 (function)
nebula::memory::keyed::FlatRowCursor.mtbl[0] = 0x41ec50 (function)
nebula::execution::core::SamplesExecutor.mtbl[0] = 0x41cb70 (function)
nebula::surface::RowCursor.mtbl[0] = 0x41cb70 (function)
nebula::execution::core::BlockExecutor.mtbl[0] = 0x41cb40 (function)
Finished
*** Aborted at 1563212180 (unix time) try "date -d @1563212180" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGABRT (@0x3bb200003e5f) received by PID 15967 (TID 0x7f1f77645080) from PID 15967; stack trace: ***
@ 0x7f1f7702c330 (unknown)
@ 0x7f1f76455c37 gsignal
@ 0x7f1f76459028 abort
@ 0x7f1f764922a4 (unknown)
@ 0x7f1f7649e82e (unknown)
@ 0x699cae (unknown)
@ 0x696f26 (unknown)
@ 0x69248c (unknown)
@ 0x68c92d (unknown)
@ 0x688d23 (unknown)
@ 0x6886be (unknown)
@ 0x682e78 (unknown)
@ 0x6878d5 (unknown)
@ 0x41f2fd (unknown)
@ 0x41a94a (unknown)
@ 0x7f1f76440f45 __libc_start_main
@ 0x41c0b8 (unknown)
@ 0x0 (unknown)

jll63 commented

I have to install some deps to navigate your code but I could not find RowCursor...

jll63 commented
$ cmake ..
-- Boost version: 1.65.1
-- Found the following Boost libraries:
--   program_options
--   regex
--   system
--   filesystem
--   context
--   thread
--   chrono
--   date_time
--   atomic
-- NEBULA_ROOT : /home/jleroy/dev/nebula
-- NEBULA_SRC : /home/jleroy/dev/nebula/src
Nebula Server: /home/jleroy/dev/nebula/build/NebulaServer
CMake Error at src/service/Service.cmake:76 (file):
  file problem touching file:
  /home/jleroy/dev/nebula/src/service/gen/nebula/node/node.grpc.fb.cc
Call Stack (most recent call first):
  CMakeLists.txt:149 (include)


-- Configuring incomplete, errors occurred!

Also:

$ ls -l /home/jleroy/dev/nebula/src/service/gen/nebula/node
ls: cannot access '/home/jleroy/dev/nebula/src/service/gen/nebula/node': No such file or directory

Some parts missing?

jll63 commented

OK I created a skeleton copy of your program, which gives me the exact same yomm2 trace. It doesn't crash. Everything looks as it should be. At this point I tend to think that it is not a yomm2 problem. I suggest that you try to return just after calling update_methods to see what happens. And maybe trace in the debugger.

I have the impression no threads (except main) exist yet when you call update_methods, correct?

OK I created a skeleton copy of your program, which gave me the exact same yomm2 trace. It doesn't crash. Everything looks as it should be. At this point I tend to thing that it is not a yomm2 problem. I suggest that you try to return just after calling update_methods. And maybe trace in the debugger.

I have the impression no threads (except main) exist yet when you call update_methods, correct?

Yes - no threads. It is failed in update_methods, I added trace, and the first post has the call stack when it fails. Maybe it relates to compiler code optimization? The same code, built twice, passed some and failed some times. So my above comments on how to make it work are not true. I'll debug it and post back my findings.

jll63 commented

Any progress on this?

Any progress on this?

No, I didn't figure out why but I updated in the first post with GDB crash stack. right now, I'm using dynamic cast instead to unblock, would love to debug more if I get time.

jll63 commented

Got it. There was a bug in the last commit. Fixed now. Can you pull and check on your side?

Got it. There was a bug in the last commit. Fixed now. Can you pull and check on your side?

Hmm, interesting, looks like it's working now with latest pull. I'll keep it run for a few days and report back if this really fixes it.

jll63 commented

Hi, any more problems?

Latest fix seems like a root cause fix, I didn't see other issues in past week after pull latest fix. Thanks for the fix, @jll63 !