[rclcpp] C++ lib leaking `rclcpp::Node()` can cause segfaults at `dlclose()`
EricCousineau-TRI opened this issue · 3 comments
Bug report
Required Info:
- Operating System: Ubuntu 22.04
- Installation type: binaries
- Version or commit hash:
humble
- DDS implementation:
rmw_cyclonedds_cpp
- Client library (if applicable):
rclcpp
Steps to reproduce issue
See https://github.com/EricCousineau-TRI/rclcpp_dlclose_segfault
Expected behavior
dlclose()
does not trigger race conditions with threads -- ideally, shutdowns existing workers, or tries to order atexit
operations manually.
Actual behavior
dlclose()
appears to trigger race conditions when user leaks rclcpp::Node
.
Additional information
See README in https://github.com/EricCousineau-TRI/rclcpp_dlclose_segfault
Inline copy of README:
Relates to Drake, drake-ros
, and pydrake
.
Due to a workaround in pydrake RobotLocomotion/drake#14356,
we may not always release certain objects. (Non-ideal, but will take some effort to fix.)
If we have components that use rclcpp::Node
, but are not deleted, we can encounter segfaults.
Example minimal code (from src/ros_segfault_min_lib.cc
) capturing the core issue:
// Expected case: rclcpp::Node will be freed.
void InitNode() {
RclcppInit();
rclcpp::Node("node");
}
// Bad case: rclcpp::Node is leaked :(
void InitAndLeakNode() {
RclcppInit();
new rclcpp::Node("node");
}
This shows three cases, running 200 trials, 50 in parallel
direct_leak
: Link a shared lib directly, callInitAndLeakNode()
. No segfaults.dlopen_noleak
:dlopen()
shared lib, callInitNode()
. No segfaults.dlopen_leak
:dlopen()
lib, callInitAndLeakNode()
. Many segfaults.
The ideal workaround is that dlclose()
ideally does not trigger segfaults when (poorly)
leaking rclcpp::Node
instances.
Output
Errors w/ dlopen_leak
are non-deterministic
# Non-standard build, but just want to focus on issue.
$ source /opt/ros/humble/setup.bash
$ cmake -S . -B build
$ cmake --build build -j
$ python ./src/ros_segfault_test.py
[ direct_leak ]
num_fail: 0 / 200
returncodes: {0}
[ dlopen_noleak ]
num_fail: 0 / 200
returncodes: {0}
[ dlopen_leak ]
num_fail: 185 / 200
returncodes: {0, -11, 127}
./build/ros_segfault_min_dlopen: symbol lookup error: /opt/ros/humble/lib/x86_64-linux-gnu/libddsc.so.0: undefined symbol: ddsrt_avl_swap_node
./build/ros_segfault_min_dlopen: symbol lookup error: /opt/ros/humble/lib/x86_64-linux-gnu/libddsc.so.0: undefined symbol: ddsrt_ehh_new
./build/ros_segfault_min_dlopen: symbol lookup error: /opt/ros/humble/lib/x86_64-linux-gnu/libddsc.so.0: undefined symbol: ddsi_update_proxy_writer
./build/ros_segfault_min_dlopen: symbol lookup error: /opt/ros/humble/lib/x86_64-linux-gnu/libddsc.so.0: undefined symbol: ddsi_update_proxy_reader
./build/ros_segfault_min_dlopen: symbol lookup error: /opt/ros/humble/lib/x86_64-linux-gnu/libddsc.so.0: undefined symbol: plist_fini_generic
./build/ros_segfault_min_dlopen: symbol lookup error: /opt/ros/humble/lib/x86_64-linux-gnu/libddsc.so.0: undefined symbol: ddsrt_avl_free_arg
Attaching GDB
# Terminal 1: Run background stuff to incite segfault.
$ python ./src/ros_segfault_test.py --modes dlopen_leak --count 0
# Terminal 2: Run with gdb.
$ gdb --args ./build/ros_segfault_min_dlopen ./build/libros_segfault_min_lib.so InitAndLeakNode
# show backtrace on all threads
(gdb) thread apply all bt
Example stacktrace from main thread:
Thread 1 (Thread 0x7ffff7f8c740 (LWP 207729) "ros_segfault_mi"):
#0 __GI__dl_debug_state () at ./elf/dl-debug.c:117
#1 0x00007ffff7fc5c24 in _dl_close_worker (force=force@entry=false, map=<optimized out>, map=<optimized out>) at ./elf/dl-close.c:769
#2 0x00007ffff7fc62a2 in _dl_close_worker (force=false, map=0x555555579330) at ./elf/dl-close.c:150
#3 _dl_close (_map=0x555555579330) at ./elf/dl-close.c:818
#4 0x00007ffff7d74a98 in __GI__dl_catch_exception (exception=exception@entry=0x7fffffffd090, operate=<optimized out>, args=<optimized out>) at ./elf/dl-error-skeleton.c:208
#5 0x00007ffff7d74b63 in __GI__dl_catch_error (objname=0x7fffffffd0e8, errstring=0x7fffffffd0f0, mallocedp=0x7fffffffd0e7, operate=<optimized out>, args=<optimized out>) at ./elf/dl-error-skeleton.c:227
#6 0x00007ffff7c9012e in _dlerror_run (operate=<optimized out>, args=<optimized out>) at ./dlfcn/dlerror.c:138
#7 0x00007ffff7c8fe58 in __dlclose (handle=<optimized out>) at ./dlfcn/dlclose.c:31
#8 0x00007ffff7fa021d in rcutils_unload_shared_library () from /opt/ros/humble/lib/librcutils.so
#9 0x00007ffff7ea9370 in rcpputils::SharedLibrary::~SharedLibrary() () from /opt/ros/humble/lib/librcpputils.so
#10 0x00007ffff7ed0a66 in ?? () from /opt/ros/humble/lib/librmw_implementation.so
#11 0x00007ffff7c45495 in __run_exit_handlers (status=0, listp=0x7ffff7e1a838 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at ./stdlib/exit.c:113
#12 0x00007ffff7c45610 in __GI_exit (status=<optimized out>) at ./stdlib/exit.c:143
#13 0x00007ffff7c29d97 in __libc_start_call_main (main=main@entry=0x5555555551a9 <main>, argc=argc@entry=3, argv=argv@entry=0x7fffffffd748) --Type <RET> for more, q to quit, c to continue without paging--
at ../sysdeps/nptl/libc_start_call_main.h:74
#14 0x00007ffff7c29e40 in __libc_start_main_impl (main=0x5555555551a9 <main>, argc=3, argv=0x7fffffffd748, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffd738) at ../csu/libc-start.c:392
#15 0x00005555555550e5 in _start ()
Thanks for the detailed analysis and reproduction.
My initial hunch would be that the Node is holding a function pointer to something that is getting unloaded before the library in question.
Have you tried running this under asan
or valgrind
to see if there is more detailed stack traces?
I ran it under asan
, ubsan
, and valgrind
, but didn't see anything obvious leap out:
EricCousineau-TRI/rclcpp_dlclose_segfault@425f602
Perhaps rcl_node_init
-> pthread_create
is something that could have function pointer / some other data race condition when dlcose()
is triggered, like you said?