testthread intermittently crashes with a segfault during teardown
Closed this issue · 4 comments
After updating the sdl2-compat snapshot in Debian experimental from b50099f to 8f31681 (with a corresponding SDL3 update), testthread
crashed during build-time testing on several autobuilders. This seems to be intermittent, but I was able to reproduce it on x86_64. After looking at the crash and the test code, I think it might really be a bug in testthread
rather than a bug in the SDL library or sdl2-compat, but I'm not 100% sure.
The segfault (with AddressSanitizer) looks like this:
==1484593==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x000000000000 bp 0x7f4057bffbe0 sp 0x7f4057bffbd8 T2)
==1484593==Hint: pc points to the zero page.
==1484593==The signal is caused by a READ memory access.
==1484593==Hint: address points to the zero page.
#0 0x0 (<unknown module>)
#1 0x7f405e94c3e6 in SDL_Log /home/smcv/src/sdl2-compat/src/dynapi/SDL_dynapi.c:219
#2 0x56310565acb3 in ThreadFunc /home/smcv/src/sdl2-compat/test/testthread.c:73
#3 0x7f40590e9094 in SDL_RunThread /home/smcv/src/SDL/src/thread/SDL_thread.c:323
#4 0x7f405969bd3d in RunThread /home/smcv/src/SDL/src/thread/pthread/SDL_systhread.c:69
#5 0x7f405ea5b2d5 in asan_thread_start ../../../../src/libsanitizer/asan/asan_interceptors.cpp:234
which indicates that we're calling a null function pointer.
I think what is happening here is:
- the signal handler
killed()
setsalive = 0
to signalThreadFunc
#2
to exit - The two threads race:
- the signal handler in the main thread calls
quit(0)
which tears down internal library state withSDL_Quit()
; - the
ThreadFunc
callsSDL_Log
andSDL_Delay
- the signal handler in the main thread calls
- If the
SDL_Log
andSDL_Delay
function pointers inSDL_dynapi
are still non-null when they are called because the main thread has not yet reachedSDL_Quit()
, then there is no problem - but if the signal handler in the main thread wins the race and calls
SDL_Quit()
, then theSDL_Log
function pointer (or some other function pointer that is called internally in the dynapi code?) gets set to null, then theThreadFunc
crashes when it tries to call that function
The solution would be to synchronize by waiting for the ThreadFunc
to exit (call SDL_WaitThread()
) before SDL_Quit()
, and ideally document SDL_Quit()
as invalid to call while any other thread might still be calling into SDL APIs.
Or should SDL_Quit()
be keeping the function pointers minimally valid until the process exits?
I'm not completely clear on what this test intends to be testing, and it was already mostly there at the beginning of recorded history (libsdl-org/SDL-1.2@359adb41 in 2001 didn't call SDL_Log
, but did call SDL_Delay
); so I don't know whether adding a SDL_WaitThread()
is a valid fix, or whether it's intentionally exiting without waiting for a thread to verify that it's OK to do so.
I was going to say it does, but then I realized I was looking at SDL3 code. I'll backport that fix to SDL2.
Nice, a memory leak fix that is also accidentally a data race fix :-)
And now merged to SDL2. :)