libsdl-org/sdl2-compat

testthread intermittently crashes with a segfault during teardown

Closed this issue · 4 comments

smcv commented

After updating the sdl2-compat snapshot in Debian experimental from b50099f to 8f31681 (with a corresponding SDL3 update), testthread crashed during build-time testing on several autobuilders. This seems to be intermittent, but I was able to reproduce it on x86_64. After looking at the crash and the test code, I think it might really be a bug in testthread rather than a bug in the SDL library or sdl2-compat, but I'm not 100% sure.

The segfault (with AddressSanitizer) looks like this:

==1484593==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x000000000000 bp 0x7f4057bffbe0 sp 0x7f4057bffbd8 T2)
==1484593==Hint: pc points to the zero page.
==1484593==The signal is caused by a READ memory access.
==1484593==Hint: address points to the zero page.
    #0 0x0  (<unknown module>)
    #1 0x7f405e94c3e6 in SDL_Log /home/smcv/src/sdl2-compat/src/dynapi/SDL_dynapi.c:219
    #2 0x56310565acb3 in ThreadFunc /home/smcv/src/sdl2-compat/test/testthread.c:73
    #3 0x7f40590e9094 in SDL_RunThread /home/smcv/src/SDL/src/thread/SDL_thread.c:323
    #4 0x7f405969bd3d in RunThread /home/smcv/src/SDL/src/thread/pthread/SDL_systhread.c:69
    #5 0x7f405ea5b2d5 in asan_thread_start ../../../../src/libsanitizer/asan/asan_interceptors.cpp:234

which indicates that we're calling a null function pointer.

I think what is happening here is:

  • the signal handler killed() sets alive = 0 to signal ThreadFunc #2 to exit
  • The two threads race:
    • the signal handler in the main thread calls quit(0) which tears down internal library state with SDL_Quit();
    • the ThreadFunc calls SDL_Log and SDL_Delay
  • If the SDL_Log and SDL_Delay function pointers in SDL_dynapi are still non-null when they are called because the main thread has not yet reached SDL_Quit(), then there is no problem
  • but if the signal handler in the main thread wins the race and calls SDL_Quit(), then the SDL_Log function pointer (or some other function pointer that is called internally in the dynapi code?) gets set to null, then the ThreadFunc crashes when it tries to call that function

The solution would be to synchronize by waiting for the ThreadFunc to exit (call SDL_WaitThread()) before SDL_Quit(), and ideally document SDL_Quit() as invalid to call while any other thread might still be calling into SDL APIs.

Or should SDL_Quit() be keeping the function pointers minimally valid until the process exits?

smcv commented

I'm not completely clear on what this test intends to be testing, and it was already mostly there at the beginning of recorded history (libsdl-org/SDL-1.2@359adb41 in 2001 didn't call SDL_Log, but did call SDL_Delay); so I don't know whether adding a SDL_WaitThread() is a valid fix, or whether it's intentionally exiting without waiting for a thread to verify that it's OK to do so.

I was going to say it does, but then I realized I was looking at SDL3 code. I'll backport that fix to SDL2.

smcv commented

Nice, a memory leak fix that is also accidentally a data race fix :-)

And now merged to SDL2. :)