Gabriella439/foldl

Memory blowing up with concurrent folds and streaming

mpdairy opened this issue · 5 comments

I'm using 'streaming 0.1.4.5' and foldl 1.4.3 and I'm doing three FoldM's over a Streaming stream. When I combine them together using (,,) <$> fold1 <*> fold2 <*> fold3 and run the fold, the memory blows up to about 8GB and then usually crashes my computer. However, if I run each fold separately in sequence, the memory usage never gets above 1.4GB during any of the three folds. It seems like, at worst, the memory usage should be approximately 1.4GB * 3 for the combined fold. Do you know why it would blow up to so much more than that?

I also tried to manually combine the three folds instead of using FoldM's applicative instance, and the memory blew up just the same.

@mpdairy: The first thing that stands out to me is that each of your FoldMs individually uses 1.4 GB of memory. Could you provide some more context about what the accumulator for each FoldM looks like and why it might grow so large even when run in isolation?

The only thing I can think of to explain what you observe is that the accumulator of a composite FoldM is going to use a strict Pair constructor. That strict Pair may be forcing a value that was previously a smaller thunk when the FoldM was run in isolation.

For most of the out-of-the-box folds provided by the library the thunks are larger than the fully strict value because the expected size of the fully strict accumulator is tiny (i.e. < 100 Bytes), whereas a thunk can grow without bound if never forced so being strict is usually the right tradeoff. However, the accumulator size for each of your FoldMs is unexpectedly large and the unevaluated thunk of that large value might actually be a more compact representation than the fully strict form. That's why it would help to better understand what the accumulator for each individual FoldM looks like.

@Gabriel439 Thanks for your reply. The stream is taken from a postgres database of about 800,000 JSON messages, which are decoded into a Stream of a custom MessageEvent type. To be sure it's not just a streaming problem, I later instead did a regular query that put the entire 800k messages into a list. It takes up a bit more memory right after the query (1.9GB), but then it stays at 1.9GB the rest of the time if I'm doing the three separate FoldMs, and then it still explodes if I combine the three FoldMs into one.

The first fold creates a Corpus (http://hackage.haskell.org/package/chatter-0.9.1.0/docs/NLP-Types.html#t:Corpus), which is just an Int and a lazy map of word counts. The second fold opens a temporary text file and appends the text of every message onto the file in each step, and at the end it runs google's wordvector program on the file and then deletes it. The third fold looks at the parts of speech tags (which are part of MessageEvent) and saves the MessageId's of the messages that have questions in them.

So the first two shouldn't take up any increasing memory as they go along, and the third should be minimal. (In fact, if I just pass along an empty list for the fold's state instead of building it up each step, the memory still explodes). And, like I said, when each is run separately the memory stays constant (at least on the non-streaming version of my test), and they only explode when they are made into one FoldM.

Also, I tried both the applicative (,,) <$> fold1 <*> fold2 <*> fold3 version and I wrote my own manually combined fold with a simple record data type:

data MegaFoldState = MegaFoldState
  { megaFoldQuestions   :: [MessageId]
  , megaFoldCorpus      :: Corpus
  , megaFoldWordVectors :: (FilePath, Handle)
  } deriving (Show, Eq)

I tried making all the fields lazy and then I tried all strict, and in both cases the memory exploded.

Anyways, considering that it happens with my manual fold, and it happens both with Foldl.foldM and Streaming's foldM_, I don't think it's a problem with your library, so you can probably close this issue. But if you have any more ideas I would really appreciate them!

@Gabriel439 I found the cause of the balooning memory. It was the lazy map inside Corpus. I forced it to evaluate every step of the fold (by using the Map.member function, for now) and now the memory used for the whole program stays under 1GB. I wonder if you could somehow enforce deep strictness for the fold state in your library?

@mpdairy: You don't need to enforce deep strictness in the foldl library. All you have to do is set up your accumulator so that evaluating it to weak-head-normal-form is the same as evaluating it to normal form. You can do this by wrapping your accumulator value in Control.DeepSeq.rnf after every step.

@Gabriel439 Yeah, I just discovered DeepSeq, but for some reason running rnf on the lazy map at every iteration is incredibly slow compared to doing if Map.null mymap then go else go, and both fix the memory explosion problem (which is weird, because I think null would still leave a bunch of lazy stuff in the values and keys).