Use after free in `read_proc`
Closed this issue · 8 comments
In pika we have some tests that link to libapex.so
but don't actually use apex, that often trigger a segmentation fault or similar late after main has exited. The commonality in these tests is that main
typically runs very quickly. This can be reproduced with a simple program which simply contains
int main() {}
and links to apex, and rerunning it many times. I did e.g. this testing with g++ -g -O0 -lapex main.cpp
using GCC 10, but I think it should translate to any compiler.
The issue seems to be that when main
exits quickly, apex also reaches cleanup quite quickly (
apex/src/apex/apex_preload.cpp
Line 113 in 6edfb92
proc_data_reader
thread spawned in Line 66 in 6edfb92
proc_data_reader
is destroyed (Line 151 in 6edfb92
If I understand the code correctly this is because the thread is detached here:
Line 67 in 6edfb92
Line 75 in 6edfb92
detach
, avoids the issue, but I don't know if it has other effects...
Note: while the above test fails after running it enough times, I found it easy to reproduce the issue by adding a few sleeps:
diff --git a/src/apex/apex_preload.cpp b/src/apex/apex_preload.cpp
index fa3ea381..f01bb22a 100644
--- a/src/apex/apex_preload.cpp
+++ b/src/apex/apex_preload.cpp
@@ -112,6 +112,7 @@ int apex_preload_main(int argc, char** argv, char** envp) {
}
apex::cleanup();
}
+ std::this_thread::sleep_for(std::chrono::seconds(1)); // don't quit right away to allow read_proc to use memory after it's been freed
return ret;
}
diff --git a/src/apex/proc_read.cpp b/src/apex/proc_read.cpp
index c23de038..c2eb6a3a 100644
--- a/src/apex/proc_read.cpp
+++ b/src/apex/proc_read.cpp
@@ -768,6 +768,9 @@ namespace apex {
/* This is the main function for the reader thread. */
//void* proc_data_reader::read_proc(void * _ptw) {
void* proc_data_reader::read_proc() {
+ std::this_thread::sleep_for(std::chrono::seconds(0.5)); // sleep long enough that apex has shut down, but not long enough that the whole application has exited already
in_apex prevent_deadlocks;
// when tracking memory allocations, ignore these
in_apex prevent_nonsense;
Running this through valgrind, it reports with near certainty:
==105902== Thread 2:
==105902== Invalid read of size 1
==105902== at 0x4B5CED7: load (atomic_base.h:426)
==105902== by 0x4B5CED7: std::atomic<bool>::operator bool() const (atomic:87)
==105902== by 0x4C386BD: apex::proc_data_reader::read_proc() (proc_read.cpp:795)
==105902== by 0x4B7318D: void* std::__invoke_impl<void*, void* (apex::proc_data_reader::*)(), apex::proc_data_reader*>(std::__invoke_memfun_deref, voi
d* (apex::proc_data_reader::*&&)(), apex::proc_data_reader*&&) (invoke.h:73)
==105902== by 0x4B730D1: std::__invoke_result<void* (apex::proc_data_reader::*)(), apex::proc_data_reader*>::type std::__invoke<void* (apex::proc_data
_reader::*)(), apex::proc_data_reader*>(void* (apex::proc_data_reader::*&&)(), apex::proc_data_reader*&&) (invoke.h:95)
==105902== by 0x4B73042: void* std::thread::_Invoker<std::tuple<void* (apex::proc_data_reader::*)(), apex::proc_data_reader*> >::_M_invoke<0ul, 1ul>(s
td::_Index_tuple<0ul, 1ul>) (thread:264)
==105902== by 0x4B72FA1: std::thread::_Invoker<std::tuple<void* (apex::proc_data_reader::*)(), apex::proc_data_reader*> >::operator()() (thread:271)
==105902== by 0x4B72F05: std::thread::_State_impl<std::thread::_Invoker<std::tuple<void* (apex::proc_data_reader::*)(), apex::proc_data_reader*> > >::
_M_run() (thread:215)
==105902== by 0x4E2249F: execute_native_thread_routine (thread.cc:80)
==105902== by 0x50C1AC2: start_thread (pthread_create.c:442)
==105902== by 0x5152A03: clone (clone.S:100)
==105902== Address 0x52a2bf0 is 0 bytes inside a block of size 104 free'd
==105902== at 0x484DB6F: operator delete(void*, unsigned long) (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==105902== by 0x4B550A2: apex::apex::~apex() (apex.cpp:151)
==105902== by 0x4B5AA9B: apex::cleanup() (apex.cpp:1873)
==105902== by 0x4B51681: apex_preload_main (apex_preload.cpp:113)
==105902== by 0x5056D8F: (below main) (libc_start_call_main.h:58)
==105902== Block was alloc'd at
==105902== at 0x484B013: operator new(unsigned long) (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==105902== by 0x4B55D12: apex::init(char const*, unsigned long, unsigned long) (apex.cpp:531)
==105902== by 0x4B51555: apex_preload_main (apex_preload.cpp:103)
==105902== by 0x5056D8F: (below main) (libc_start_call_main.h:58)
==105902==
This was tested on apex develop (6edfb92).
I see a few other std::thread::detach
es in the codebase that may have the same issue, but I did not investigate those.
@msimberg thanks for the bug report... this problem has been there for a long time, and I am not quite sure how best to resolve it. It has not been a priority since it only happens in very short programs, as you mentioned. The reason I am detaching the thread is that in this very same case, the process sometimes hangs forever trying to join that thread. Detaching the thread and accepting the occasional, though rare, crash for very short programs was a compromise. There is a race condition in these very short programs in which the main thread is exiting before the worker thread even has a chance to start. If you have a good suggestion to fix it, that would be great... it's possible that I could set up a signal between the two threads, so the main thread won't continue until the worker thread indicates that it is ready to go...?
Thanks @khuck for the response! I see... Do you have some idea of where the thread is when it hangs? I tried to comment out the detach
, but was not able to reproduce a hang (that doesn't mean it can't happen... I might just not have the right environment/configuration options etc.).
it's possible that I could set up a signal between the two threads, so the main thread won't continue until the worker thread indicates that it is ready to go
Naively I'd say the only place where it's be safe for the main thread to continue is after std::thread::join
returns, but are you saying there might be an earlier point in read_proc
where it's safe to destroy the this
/proc_data_reader
? The reading
and done
member variables are at least accessed in the while-loop in read_proc
so I guess the earliest it could signal the main thread to continue is after the while-loop (at least without refactoring).
it was actually easier to fix than I thought... the main thread doesn't continue until the worker thread reports that it is ready. See a287e94, although it the change is mixed in with some unrelated changes.
Thanks for having a look. That does improve the situation, but I think it just delays the inevitable. It's not so much a "startup race", but a "shutdown race". Running it now through valgrind gives (e.g.):
==23277== Thread 2:
==23277== Invalid read of size 4
==23277== at 0x741FDD0: __pthread_mutex_cond_lock (in /lib64/libpthread-2.26.so)
==23277== by 0x7421ACF: pthread_cond_timedwait@@GLIBC_2.3.2 (in /lib64/libpthread-2.26.so)
==23277== by 0x4F63CBB: apex::proc_data_reader::read_proc() (in /apps/daint/UES/simbergm/spack/opt/spack/cray-cnl7-haswell/gcc-11.2.0/apex-develop-qzedq7laeqmuxondkm5pdh6lkue2ades/lib/libapex.so)
==23277== by 0x56FEF43: execute_native_thread_routine (in /apps/daint/UES/simbergm/spack/opt/spack/cray-cnl7-haswell/gcc-11.2.0/gcc-runtime-11.2.0-7k5jhwmoukhcqoghpw2w3ax5dg4kxp3g/lib/libstdc++.so.6)
==23277== by 0x741B538: start_thread (in /lib64/libpthread-2.26.so)
==23277== by 0x6082E0E: clone (in /lib64/libc-2.26.so)
==23277== Address 0x80587d0 is 80 bytes inside a block of size 112 free'd
==23277== at 0x4C2FBEB: operator delete(void*, unsigned long) (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==23277== by 0x4EAAB58: apex::apex::~apex() (in /apps/daint/UES/simbergm/spack/opt/spack/cray-cnl7-haswell/gcc-11.2.0/apex-develop-qzedq7laeqmuxondkm5pdh6lkue2ades/lib/libapex.so)
==23277== by 0x4EB280F: apex::cleanup() (in /apps/daint/UES/simbergm/spack/opt/spack/cray-cnl7-haswell/gcc-11.2.0/apex-develop-qzedq7laeqmuxondkm5pdh6lkue2ades/lib/libapex.so)
==23277== by 0x4EA7C7F: apex_preload_main (in /apps/daint/UES/simbergm/spack/opt/spack/cray-cnl7-haswell/gcc-11.2.0/apex-develop-qzedq7laeqmuxondkm5pdh6lkue2ades/lib/libapex.so)
==23277== by 0x5FAB3E9: (below main) (in /lib64/libc-2.26.so)
==23277== Block was alloc'd at
==23277== at 0x4C2E94F: operator new(unsigned long) (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==23277== by 0x4EB03E9: apex::init(char const*, unsigned long, unsigned long) (in /apps/daint/UES/simbergm/spack/opt/spack/cray-cnl7-haswell/gcc-11.2.0/apex-develop-qzedq7laeqmuxondkm5pdh6lkue2ades/lib/libapex.so)
==23277== by 0x4EA7AF6: apex_preload_main (in /apps/daint/UES/simbergm/spack/opt/spack/cray-cnl7-haswell/gcc-11.2.0/apex-develop-qzedq7laeqmuxondkm5pdh6lkue2ades/lib/libapex.so)
==23277== by 0x5FAB3E9: (below main) (in /lib64/libc-2.26.so)
where I'm guessing pthread_cond_timedwait
corresponds to
Line 833 in a287e94
proc_data_reader
instance that likely already has been destroyed.I removed the detach, this should take care of it. The main thread will now wait to join the worker thread, and it will be joinable. See cde877d
@khuck given that the fix is now on develop
, I'm wondering if you already have a plan for when the next release of apex might happen?
Good point - there have been a couple of bug fixes since the release, so I will try to do a point release this week.