piotrmurach/finite_machine

Potential memory leak in v0.11.x

craiglittle opened this issue · 20 comments

It appears there's a memory leak in the v0.11 series of the gem. When we bumped to it in our app, we started to experience severe memory issues in production.

I don't have time at the moment to do a deeper dive, but I wanted to make you aware of the issue in case you might have an inkling off the top of your head about where the problem might be. For the time being, we've had to revert back to v0.10.x until we can get to the bottom of it.

The v0.11.x is a major rewrite, the lib started to grow in duplication and became hard to maintain hence the change. There is nothing that jumps out at me as the obvious place. I will try to dig around tomorrow to do some debugging and memory profiling. I plan to write some simple state machine and loop it x amount of times on the v0.10.x and v0.11.x versions and see where that leads me. Sorry for this, I hope it can be solved easily.

I tracked this down to the fact that an instance of EventQueue is now being created whenever a state machine is defined. Each instance of EventQueue starts a background thread to process events. Since those background threads are never killed, the garbage collector can't clean them up, even when the containing state machines go out of scope.

It seems like an instance of EventQueue should be created whenever an event is triggered and shut down when the event is finished. That way, the background threads will be cleaned up each time they are finished being used.

Here's a branch with a naive tweak I made that still makes all of the specs pass (with the exception of async_events_spec) and stops the number of objects growing unbounded.

And here's the little script I used to diagnose the problem. If you comment out EventQueue.new in the StateMachine initializer, you'll see that the memory leak stops.

I hope that's helpful. I'd love to get this resolved so we can once again use the latest version of the gem!

@peter-murach Any thoughts on how much work it'd take to tackle this?

@craiglittle Sorry! I'm gonna work on it today and tomorrow, it is now my highest priority. I may throw few questions your way about some features as well.

Sounds good! Give me a shout when I can lend a hand.

Initially the EventQueue was implemented as thread global, an obvious bug. Now, I think I need to change it to only spin up threads when async events are fired. I've tried in a spike to get this done but it was ugly, so will need to take a second look.

Hey, just wanted to check in on this. I see you added a provisional fix for the issue. Any feeling on what it'll take to get it to a point where you're comfortable cutting a release?

Hi, I've worked on this yesterday but my internet went dead. I've pushed all the changes now. Basically, I've split up the event queue for events and callbacks messages(both things were mixed up together).Also, made the queues lazy load which sped things up a bit. Can you check out the master? Once happy I will release patch.

Hey, let me know when you've tested the master branch and I will cut a patch release.

@craiglittle Released v0.11.3 that keeps the memory allocation at bay, do you have time to check it out? After that I will start work on the features we have discussed in different issues for v0.12.0 release.

Sorry, I've been really swamped this week. Thanks for putting out the patch!

I'll try to check it out in the next few days and report back. Looking forward to see what you do with the other issues.

No worries, I know exactly how it feels, been there few times before.

Yes, I want to give more love to this library this year, simplify where possible and resolve all the issues.

Status update:

I tried to get this cautiously into production to prove that the issue is fixed, but one of our specs is failing with a segfault in a way that involves sleep calls. I haven't had a chance to dig into it, and frankly, I'm not really sure where to start.

So, in short, I gave it a shot, but I'm kind of blocked at the moment.

I must say this really sucks, I'm so much more determined to look into this. I was thinking about the whole threading issue. In your app, do you actually use any of the async behaviour? I'm hoping to drastically simplify synchronisation, I think things got complex for no apparent reason. I will have another take at this.

Nope, we don't use the async behavior, which makes this especially frustrating. I think you're right that that aspect of the codebase needs another look since it seems to be the primary source of the bugs we keep running into. It'd be nice if the async feature could be layered on while being minimally invasive to the core logic of the gem. The additional complexity really seems to be an issue at the moment. :/

+1. I've used this gem in multiple projects and never once used the async functionality.

I personally believe it's unnecessary complexity in this gem as well.

Hi there,

First of all, congrats on this excellent gem! Here at the company where I work at, we use finite_machine to control the different states of orders in our e-commerce.

We recently started having "ThreadError: can't create Thread: Resource temporarily unavailable" errors on production coming from our state machine (however we do not use async transitions or callbacks). We were on version 0.11.0 and upgrading to 0.11.3 so far seems to have fixed the issue.

I am not familiar with the internal workings of the gem, so please take my advice with a grain of salt, but maybe the concurrent-ruby gem can help with handling the complexity related to the async features.

Hi Alex,

Thanks for your feedback, that's really useful. I'm currently working on rewriting the threading support taking the points mentioned in this thread. Basically, things got a bit too complex. I will have to slowly unravel the code, that is, remove synchronization in lower level code and rely on higher level calls to manage complexity.

I know about the concurrent-ruby gem but I wouldn't want to use it as it is a huge dependency and I don't believe it is needed, Ruby sports enough to help us out in this situation. Plus I want to rely more on the design, e.i. immutable objects and avoiding mutation as much as possible to solve the issue.

The errors you were getting are in relation to threads holding on to references and things not being GCed. That suggests to me wrong synchronization.

Please bear with me, I will soon need help testing the master branch.

@piotrmurach Got it. Sure, just let me know if you need help testing a new release candidate =)

Released v0.12.0 that introduces few breaking changes to the API but similarly simplifies a number of things. Notable changes in relation to memory leaks:

  • introduce concurrent-ruby to handle events and callbacks thread safe access
  • removed pretty much all the internal variables synchronization and focused on synchronizing core StateMachine class methods
  • speed improvement of 2x more transitions fired per second
  • memory allocations reduced by 2x