adamcharnock/lightbus

Worker thread refactoring

Closed this issue · 7 comments

Here I will discuss the refactoring of the worker thread system used by Lightbus. I want to refactor this because:

  • It has specific cases in which it can fail. I wish to avoid these cases. Specifically WorkerNotReady and WorkerDeadlock errors.
  • I want to remove the ThreadSerializedTask as this constricts asyncio (which thread safety will allow us to do)
  • I want the Lightbus client to be thread safe

Requirements

Any solution must provide the following:

  • Thread-safety
  • Prevention of deadlock cases
  • Minimise other error cases
  • Allow user callables to start their own event loops
  • Does not complect different system components

How to achieve this

Thread-safety

Thread safety requires that access to global resources is carefully controlled to prevent race conditions. This either requires the removal of global state or – where this is not possible – marshalling of access to it. For example:

  • Network connections (e.g Redis connections) – This is global state, and only one thread should be able to read from and write to the connection at any one time. We cannot remove this global state.
  • Internal bus client state (e.g. listeners, schema).

Thread safety can be provided though mechanisms such as locking or queues.

Prevention of deadlock cases

Careful system design is the first step here. If A requires B and B requires A, then no amount of locking primitives will prevent that deadlocking. Rather, removing of blocking (in the threading sense) requirements is an important step.

Queues will likely be the preferable coordination mechanism as they can help keep different system components distinct (not complected).

Minimise other error cases

Hopefully this should fall out of a good design

Does not complect different system components

Again, thoughtful systems design, and isolation of system components.

How do we deal with exceptions in sub-tasks?

So far we've been passing around a BusClient instance and calling the shutdown_server method upon error. This then puts a flag in the BusClients _server_shutdown_queue. The handler for this stops the loop, which then does the cleanup.

This involves passing the busclient everywhere, which is aweful.

In the test implementation we now have the command/invoker pattern. This passes commands in a single direction. Options:

  • Invoker can also pass errors in the opposite direction
  • Invoker can un-gradefully handle errors on the grounds that they should be handled outside the invoker. For example, a transport should catch an error and pass it to the ClientInvoker, rather than expecting the TrasportInvoker to work in reverse.

This is ongoing and I’m working on it over the holiday period. The refactoring is largely done and I’m working on getting the tests sorted, plus a number of ‘TODO’s I’ve left in my wake.

Tests now passing. TODOs next

@apollo13, as we discussed on Discord, I'm going to add the on_error parameter back in as an argument to .listen(). Error's in handled RPCs should always be logged rather than cause a processes exit.

TODOs added in commit 9d7f2c5.

@apollo13 I've just implemented a first pass at the on_error parameter (5b0fc13, 89c57df, c8cabbd)

What I didn't say earlier is: I actually think it makes much more sense for this to be a parameter to the listener, rather than a config option. How each service wants to handle errors is a choice for that service. There is no reason that every service should treat event-handling errors the same for a given API. Thanks for the good suggestion 👍

I think this is now largely complete. I'll likely merge an do an 0.10.0 alpha1 release soon.

Merged in commit a8c106d