ClusterLabs/libqb

Threading issue with qb_loop_job_add()?

russel-m opened this issue · 6 comments

A little over a year ago, I had written to the developers list about a crash I was seeing occasionally from make_job_from_tmo() in a multi-threaded process. This resulted in Issue #436 being written / fixed, and we stepped up to libqb v2.0.3 (many sincere thanks!) Recently, I ran into another crash:

GDB: Reading symbols from mainpgm...done.
[New LWP 13166]
[New LWP 13178]
[New LWP 13177]
[New LWP 13173]
[New LWP 13172]
[New LWP 13179]
[New LWP 13168]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by 'mainpgm'.
Program terminated with signal 11, Segmentation fault.
#0  qb_list_length (head=<optimized out>) at ../include/qb/qblist.h:303
303     ../include/qb/qblist.h: No such file or directory.
Missing separate debuginfos, use: debuginfo-install jupiter-sac-oam-5.20.01-53494.x86_64
(gdb) #0  qb_list_length (head=<optimized out>) at ../include/qb/qblist.h:303
#1  get_more_jobs (s=<optimized out>, ms_timeout=<optimized out>)
    at loop_job.c:59
#2  0x00007fab8a49d6c1 in qb_loop_run (lp=<optimized out>) at loop.c:165
#3  0x0000000000aff58f in main ()
(gdb) 

(unfortunately, symbols had been stripped out)

The crash appeared to occur when the process called qb_loop_job_add() from a non-main-event-loop thread. We've had the call there for over a year, and we hadn't noticed a problem, but I took a look at the code in lib/loop_job.h and lib/loop_job.c, and I am suspicious that the same mutexes are needed in the job list as were added to the timer list for #436. The crash doesn't occur often, but I suspect there's a race condition there.

Firstly, apologies for taking so long to get to this, I only just noticed it today so I need to check my notifications from github are working properly.

After fixing the bug in the timer list last year I went through libqb to assess how thread-safe it really is. And the answer was: it just isn't - at all - thread safe in any meaningful way. So while I'm not ruling out fixing the issue here, I suspect there are lots of others just waiting to bite us/you if only we wait long enough. And I also worry about setting a precedent that any/all threading issues will get fixes - which would be totally impractical for such a small team (me).

As an example of the scale of the problem, the map implementations have no locking in them at all. Any attempt to use them in a multi-threaded environment is guaranteed to hit problems fairly quickly. I did look into how hard it would be to add locking to them and the answer was - they would need re-writing from scratch to make them usable AND thread-safe.

sdake commented

Hi Gang,

During the history of Corosync, the project transitioned from a single thread, to multiple threads, back to single threads. The libqb library was never intended to be thread safe. The main rationale against threads is the cost of a mutex is so astronomical that any benefit from concurrency is wiped out.

Years and years of benchmarking and technical analysis went into this decision. The final result, a single thread event loop, produced the fastest implementation of totem ever made at over 300k messages per second on Nehalem circa 2008 hardware.

Cheers
steve

sdake commented

Also I am not stating this is right for the current project, only why it is this way.

Thanks for clarifying that Steven. Corosync still runs single-threaded (though knet doesn't) so doesn't need or want a thread-safe libqb.

I think at one point, libqb was (bizarrely) advertised as thread-safe, which a cursory examination of the code reveals is just not true.

I would, however, suggest one fix that can be done right away: https://wiki.clusterlabs.org/wiki/Category:Libqb “Except for some documented anti-pattern use cases regarding IPC communication and logging, it is deemed thread-safe.” That statement seems a bit misleading now!

Thanks, I've updated the wiki page