snapframework/snap

Is it possible that /admin/reload would make the server leak memory?

Closed this issue · 5 comments

Hello folks,

Sorry if it is not relevant anymore in the upcoming 1.0 release, but I thought would have been quite useful to bring it up nevertheless.

I have a web server that essentially restart itself if certain conditions are met; in particular, I have a main snaplet that holds the state of a connection to a RabbitMQ broker. If such connection is lost, for example due to a network failure, I use /admin/reload to reload the state of my snaplet back into memory.

As part of a particular catastrophic simulation, the reload function was executed by the server quite frantically, until the broker went back online.
Monitoring the memory allocation of my server though I did notice a huge memory allocation. Profiling reveals the following:

screen shot 2015-02-09 at 15 47 06

screen shot 2015-02-09 at 15 52 56

screen shot 2015-02-09 at 15 50 56

I couldn't actually profile with -hr as I get this GHC error:

hermes-server: internal error: Invalid object in isRetainer(): 39
    (GHC version 7.8.3 for x86_64_apple_darwin)
    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug
[1]    19712 abort      hermes-server -p 9939 +RTS -hr -L120

As far as I can see one possible suspect is this Chunk data structure, which may be coming from text:

http://hackage.haskell.org/package/text-1.0.0.0/docs/Data-Text-Lazy-Internal.html

Browsing through the snap code, I can see that the reload function does nothing but calling an IO action which contains an Either Text Text:

, _reloader :: IO (Either Text Text) -- might change

In particular, I can see that you are forcing the result of these IO action in the reload handler:

!res <- liftIO $ _reloader cfg

But despite that, as Either is a lazy data structure, the underlying text within contained will not be forced, and this might explain the leak.
I'm not fully convinced of this track though, as also ByteString has a Chunk type, and that ARR_WORDS might be telling us exactly that.

Do you guys think:

  1. This is a bug in my application code
  2. It's a potential bug in Snap
  3. Is not a bug but a "feature": There is no way we can protect ourselves from this to happen.

Thank you in advance!

Alfredo

Hmm, since Text is such a commonly used structure, it's hard to say what
might be causing this problem. It's entirely possible that this leak could
be a problem with snap's reloading code, but it's hard to know for sure. I
think there are two ways you could approach this.

  1. Try to fix the problem yourself, but be aware that you might need to fix
    something in Snap.
  2. Build us a minimal demonstration of the problem that we can use to debug
    Snap.

It's hard to say which would be easier for you.

On Mon, Feb 9, 2015 at 10:36 AM, Alfredo Di Napoli <notifications@github.com

wrote:

Hello folks,

Sorry if it is not relevant anymore in the upcoming 1.0 release, but I
thought would have been quite useful to bring it up nevertheless.

I have a web server that essentially restart itself if certain conditions
are met; in particular, I have a main snaplet that holds the state of a
connection to a RabbitMQ broker. If such connection is lost, for example
due to a network failure, I use /admin/reload to reload the state of my
snaplet back into memory.

As part of a particular catastrophic simulation, the reload function was
executed by the server quite frantically, until the broker went back online.
Monitoring the memory allocation of my server though I did notice a huge
memory allocation. Profiling reveals the following:

[image: screen shot 2015-02-09 at 15 47 06]
https://cloud.githubusercontent.com/assets/442035/6108771/0fa0d9be-b076-11e4-94b9-15ce0b03cf39.png

[image: screen shot 2015-02-09 at 15 52 56]
https://cloud.githubusercontent.com/assets/442035/6108778/17d5f31c-b076-11e4-80a8-0e0119b9b00e.png

[image: screen shot 2015-02-09 at 15 50 56]
https://cloud.githubusercontent.com/assets/442035/6108782/1fcc5bba-b076-11e4-8247-c0e87a921f9e.png

I couldn't actually profile with -hr as I get this GHC error:

hermes-server: internal error: Invalid object in isRetainer(): 39
(GHC version 7.8.3 for x86_64_apple_darwin)
Please report this as a GHC bug: http://www.haskell.org/ghc/reportabug
[1] 19712 abort hermes-server -p 9939 +RTS -hr -L120

As far as I can see one possible suspect is this Chunk data structure,
which may be coming from text:

http://hackage.haskell.org/package/text-1.0.0.0/docs/Data-Text-Lazy-Internal.html

Browsing through the snap code, I can see that the reload function does
nothing but calling an IO action which contains an Either Text Text:

, _reloader :: IO (Either Text Text) -- might change

In particular, I can see that you are forcing the result of these IO
action in the reload handler:

!res <- liftIO $ _reloader cfg

But despite that, as Either is a lazy data structure, the underlying text
within contained will not be forced, and this might explain the leak.
I'm not fully convinced of this track though, as also ByteString has a
Chunk type, and that ARR_WORDS might be telling us exactly that.

Do you guys think:

  1. This is a bug in my application code
  2. It's a potential bug in Snap
  3. Is not a bug but a "feature": There is no way we can protect ourselves
    from this to happen.

Thank you in advance!

Alfredo


Reply to this email directly or view it on GitHub
#137.

Hi Doug, thanks for the prompt reply.

I think that considering the delicate transition the project is facing (0.x -> 1.0), you and Greg would be probably in the best position to troubleshoot what's happening, so I would be inclined to go with the number two.

I will try to whip up a minimal example reproducing the bug later this week.
Thanks :)

Alfredo

I'm betting this has to do with some snaplet (either heist or something in your user code) doing a lazy modifyMVar instead of a strict one; that would mean that until the contents of the MVar are demanded, the MVar would contain a thunk chain. Similar caution applies to e.g. IORef.

Are you using the heist cache tag anywhere?

Hello Greg,
I'm not using the heist cache anywhere, but my snaplet does indeed use a lot of IORefs, having to deal with the mutant nature of the RabbiMQ brokering. I was always very cautious though, and I am using atomicModifyIORef' everywhere I need to modify the content of a ref.

However, I can see that the underlying RabbitMQ library is using modifyMVar in a couple of places:

https://github.com/hreinhardt/amqp/search?utf8=%E2%9C%93&q=modifyMVar

So indeed, that may be the source of the leak. During this week, I will do two things:

  1. Come up with an experimental patch for the amqp library, which uses the strict version of the MVar operations, and then profile again;
  2. Reproduce a minimal example of snaplet reloading, but this time without any state, to see if the leak comes from the server rather than from the application code.

That should give us a better picture of what's going on.

Thanks guys!
Alfredo

I think we can close the case.

After having managed to run the profiling with the -hr option enabled ( the crash seems to be an already reported GHC bug, which can be partially worked around by removing the -N threading option when running the program. After that, and a bit of cost centre annotations, the culprit was evident:

screen shot 2015-02-10 at 13 06 35

(It is renown that http-client's Manager is a pretty heavy object to create, and I'm pretty sure that my code was not strict enough, causing more and more managers to be created every time).

I am inclined to close this issue now: I still have a marginal leak in my profiling, but I do not think it is any way related to the reloading portion of the framework.

Sorry for the noise guys!