uoaerg/wavemon

Not clear that lock is always taken for init_scan_list

cbiffle opened this issue · 4 comments

I've noticed wavemon segfaulting a lot, and took the time to rebuild with debug symbols and capture a coredump.

It looks like it's interacting with bogus contents in the scan_result in init_scan_list. To reproduce this, start wavemon 0.9.1 (I am on Arch, specifically, running the 0.9.1-1 package version) and spam the F1-F3-F1-F3 keys until it crashes. Usually takes less than ten seconds for me.

I noticed this line in the source:

/** Initialize scan results. Requires lock to be taken. */

Nothing about the structure of the program ensures that the lock is taken, and in particular, it looks like the code path through main into scr_aplst_init calls init_scan_list on a global variable without any synchronization. It's also not clear how fini synchronizes with the mutex management in the scan thread; I don't see any calls to pthread_setcanceltype, so the pthread_cancel will take effect at some point later, possibly leading to data races. (I notice the mutex management changed in c74e52d and some synchronization was removed; I don't totally understand the change.)

On reflection, it also looks like the scan thread can call POSIX "cancellation point" APIs while the mutex is locked, which would mean that the thread could be cancelled without unlocking the mutex. Is there something preventing this that I've missed? (I would expect this to produce deadlocks etc. and I have not seen them, so I'm simply curious about this one.)

Edit: Also, where is the scan result mutex actually initialized? The code appears to rely on "all zeroes" being a legal mutex, which happens to work on glibc (apparently!) but is not correct per the POSIX standard, which states, "If mutex does not refer to an initialized mutex object, the behavior of pthread_mutex_lock(), pthread_mutex_trylock(), and pthread_mutex_unlock() is undefined."

So ... what is the solution you are suggesting?

It took me a while to reconstruct what is going on. The simplification in c74e52d removed the use of mutexes when destroying the scan list, since the thread can be canceled with a mutex held. The change was to use a static initializer (PTHREAD_MUTEX_INITIALIZER) in init_scan_list() - and I forgot to update the comment.

Your first comment above identifies the bug that c74e52d did not fix: pthread_cancel is called without a subsequent pthread_join, hence the thread can continue to run after parts of the data structure have already been freed.

I will rework the original fix and revisit the mutex handling. Thank you for reporting this.

@cbiffle - I have made comprehensive changes following your suggestions. Would you be able to compile from master and give this a spin? The result looks correct to my eyes, and I am not able to produce a segfault on my system.