Threading issue with qb_loop_job_add()?

Question

Threading issue with qb_loop_job_add()?

russel-m opened this issue 3 years ago · 6 comments

A little over a year ago, I had written to the developers list about a crash I was seeing occasionally from make_job_from_tmo() in a multi-threaded process. This resulted in Issue #436 being written / fixed, and we stepped up to libqb v2.0.3 (many sincere thanks!) Recently, I ran into another crash:

GDB: Reading symbols from mainpgm...done.
[New LWP 13166]
[New LWP 13178]
[New LWP 13177]
[New LWP 13173]
[New LWP 13172]
[New LWP 13179]
[New LWP 13168]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by 'mainpgm'.
Program terminated with signal 11, Segmentation fault.
#0  qb_list_length (head=<optimized out>) at ../include/qb/qblist.h:303
303     ../include/qb/qblist.h: No such file or directory.
Missing separate debuginfos, use: debuginfo-install jupiter-sac-oam-5.20.01-53494.x86_64
(gdb) #0  qb_list_length (head=<optimized out>) at ../include/qb/qblist.h:303
#1  get_more_jobs (s=<optimized out>, ms_timeout=<optimized out>)
    at loop_job.c:59
#2  0x00007fab8a49d6c1 in qb_loop_run (lp=<optimized out>) at loop.c:165
#3  0x0000000000aff58f in main ()
(gdb)

(unfortunately, symbols had been stripped out)

The crash appeared to occur when the process called qb_loop_job_add() from a non-main-event-loop thread. We've had the call there for over a year, and we hadn't noticed a problem, but I took a look at the code in lib/loop_job.h and lib/loop_job.c, and I am suspicious that the same mutexes are needed in the job list as were added to the timer list for #436. The crash doesn't occur often, but I suspect there's a race condition there.

Answer 1 · 2022-06-13T09:17:11.000Z

Firstly, apologies for taking so long to get to this, I only just noticed it today so I need to check my notifications from github are working properly.

After fixing the bug in the timer list last year I went through libqb to assess how thread-safe it really is. And the answer was: it just isn't - at all - thread safe in any meaningful way. So while I'm not ruling out fixing the issue here, I suspect there are lots of others just waiting to bite us/you if only we wait long enough. And I also worry about setting a precedent that any/all threading issues will get fixes - which would be totally impractical for such a small team (me).

As an example of the scale of the problem, the map implementations have no locking in them at all. Any attempt to use them in a multi-threaded environment is guaranteed to hit problems fairly quickly. I did look into how hard it would be to add locking to them and the answer was - they would need re-writing from scratch to make them usable AND thread-safe.

Answer 2 · 2022-06-13T09:31:26.000Z

Hi Gang,

During the history of Corosync, the project transitioned from a single thread, to multiple threads, back to single threads. The libqb library was never intended to be thread safe. The main rationale against threads is the cost of a mutex is so astronomical that any benefit from concurrency is wiped out.

Years and years of benchmarking and technical analysis went into this decision. The final result, a single thread event loop, produced the fastest implementation of totem ever made at over 300k messages per second on Nehalem circa 2008 hardware.

Cheers
steve

Answer 3 · 2022-06-13T09:32:31.000Z

Also I am not stating this is right for the current project, only why it is this way.

Answer 4 · 2022-06-13T09:38:22.000Z

Thanks for clarifying that Steven. Corosync still runs single-threaded (though knet doesn't) so doesn't need or want a thread-safe libqb.

I think at one point, libqb was (bizarrely) advertised as thread-safe, which a cursory examination of the code reveals is just not true.

Answer 5 · 2022-06-13T11:44:17.000Z

Hi Chrissy! That’s OK. The email list gets filtered by our email system for some reason, so I went down the github route instead this time. I just figured you were busy. In the one situation that was causing significant problems, I ended up just registering a signal handler via qb_loop_signal_add() and having the other thread send the process a SIGUSR2 to notify the main event loop that work needs to be done. Since the setup all occurs on the main event loop (or before entering the main event loop), and all the other thread does is kill(getpid(), SIGUSR2), there’s no race issues and so far no crashes. The shared data that describes the work is, of course, mutex protected, so no issues there. We still do have some other instances where we’re still using qb_loop_job_add() or qb_loop_timer_add() from another thread, but I’ll switch them over to the other model if a fix is not likely in the short term. It’s pretty clear that it’s just luck that is keeping us safe there. I would, however, suggest one fix that can be done right away: https://wiki.clusterlabs.org/wiki/Category:Libqb “Except for some documented anti-pattern use cases regarding IPC communication and logging, it is deemed thread-safe.” That statement seems a bit misleading now! Thanks again for looking into it! I hope sometime it is rewritten to be thread-safe, but for now, we’ll take the approach of avoiding direct interaction with libqb from any thread but the main event loop thread. Russ From: Chrissie Caulfield ***@***.***> Sent: Monday, June 13, 2022 5:17 AM To: ClusterLabs/libqb ***@***.***> Cc: Miranda, Russel (US) ***@***.***>; Author ***@***.***> Subject: EXTERNAL: Re: [ClusterLabs/libqb] Threading issue with qb_loop_job_add()? (Issue #462) Firstly, apologies for taking so long to get to this, I only just noticed it today so I need to check my notifications from github are working properly. After fixing the bug in the timer list last year I went through libqb to assess how thread-safe it really is. And the answer was: it just isn't - at all - thread safe in any meaningful way. So while I'm not ruling out fixing the issue here, I suspect there are lots of others just waiting to bite us/you if only we wait long enough. And I also worry about setting a precedent that any/all threading issues will get fixes - which would be totally impractical for such a small team (me). As an example of the scale of the problem, the map implementations have no locking in them at all. Any attempt to use them in a multi-threaded environment is guaranteed to hit problems fairly quickly. I did look into how hard it would be to add locking to them and the answer was - they would need re-writing from scratch to make them usable AND thread-safe. — Reply to this email directly, view it on GitHub<#462 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AEDUFN2E6ELMDCHN3JEJMM3VO34CFANCNFSM5RJCIKJQ>. You are receiving this because you authored the thread.Message ID: ***@***.******@***.***>>

Answer 6 · 2022-06-13T14:37:35.000Z

I would, however, suggest one fix that can be done right away: https://wiki.clusterlabs.org/wiki/Category:Libqb “Except for some documented anti-pattern use cases regarding IPC communication and logging, it is deemed thread-safe.” That statement seems a bit misleading now!

Thanks, I've updated the wiki page