spotify/mobius

Race condition results in RejectedExecutionException crash with default event runner

larryng opened this issue · 7 comments

There appears to be a race condition between MobiusLoop#dispatchEvent and MobiusLoop#dispose where eventDispatcher can accept a new event after it's been disposed because dispatchEvent() is not synchronized with dispose().

Sample app to reproduce the issue: https://github.com/larryng/MobiusBug

I think a possible fix may be to simply set disposed = true before disposing everything in dispose() instead of after.

cc @anawara

Thanks for reporting @larryng! It was great meeting you the other evening! We'll look into this and fix it asap

I'm trying to figure out what would be a good behaviour for this use case. When this happens - thread A calls dispose and after that, thread B calls dispatchEvent, we explicitly wanted that to lead to an exception. The only possible other behaviour is to ignore events posted after a dispose, and that is problematic because it can hide implementation errors and lead to odd behaviours, memory leaks, etc.

AFAICT, your suggestion will (usually) change the exception from RejectedExecutionException to IllegalStateException, but unless the caller is prepared to catch that, the application will still crash. Note that because the eventRunner runs on a separate thread, there is a race between dispatchEvent and dispose that can lead to a minor problem/less clear feedback/possibly crashes, depending on event runner implementation:

  1. dispatchEvent is called, and the event is successfully passed on to the event runner (but not yet executed by the runner).
  2. dispose is called, and the event runner is shut down.
  3. the ExecutorServiceWorkRunner.dispose method will log that there are unknown tasks/events that haven't been processed. We chose to simply log the size of the queue, since at that point, there's not going to be a meaningful toString implementation for the tasks in the queue.

For some runners, it's possible that step 3 might crash instead of logging a warning.

I guess this is a long-winded way of saying: if I understand the report right, I think this crash is how we designed it, and that we consider it to be an application-layer bug rather than a framework problem - and fixing it at the framework level leads to other undesirable effects, in particular less transparency about this type of error, for cases when you need to get that feedback. IMO, in this case, transparency trumps convenience. But maybe, depending on what the desired behaviour is, we could add some kind of convenience that handles the different ways that dispose and dispatchEvent can race with each other in a safe way. For instance, in places where we know it doesn't reduce transparency, we use SafeConnectable to ignore things post-disposal.

If MobiusLoop is working as intended, it could be that the race condition is in RxMobiusLoop.

If you think the issue is in my sample app, I'd love to know your thoughts, though all it does is set up a Mobius controller with a constant stream of events and repeatedly restart the controller at random intervals.

Apologies, I think I didn't read your code correctly the first time and didn't realise you were using an event source, meaning its subscription that should be disposed correctly by Mobius. Given that, I think it is a bug in the dispose logic - basically, the event source should be prevented from issuing further events to the loop being disposed before the event runner is disposed. I think it should be reproducible with a unit test instead of a whole app, so that would be the next step. I will try to find some time to reproduce it and verify that hypothesis but it may take me a few days - @anawara and @togi, if any of you plan to look at this, let me know and I'll drop it. :)

No worries -- I definitely could've made the code easier to read for y'all. Thanks for taking a look and the thorough write-ups!

togi commented

Alright so I have dug around this issue a bit now and I found two separate issues that both caused the ExecutionRejectedException to manifest.

As was suspected, both of the issues are related to the contract of dispose() methods in Mobius. The basic idea is that any ongoing work must be completed before a dispose method returns. So for example if you have a reference to a consumer, you may block in dispose() and send some extra messages to that consumer before returning. But once dispose() returns you may never use that consumer again.

The first problem had to do with shutdown ordering in MobiusLoop.dispose(). The WorkRunners that dispatch events and effects inside of MobiusLoop were disposed before the event source and effect handler, meaning if either of those would block in dispose() and emit some events, they would be using disposed and therefore invalid WorkRunners.

The second problem had to do more specifically with ExecutorServiceWorkRunner. It simply did not implement dispose() correctly and was designed to give up on waiting for ongoing work after 100ms. This meant that if something for would take more than 100ms to dispatch a message, it could end up sending that message to a disposed consumer.

togi commented

Fixed in v1.2.1