AndrasKovacs/flatparse

chainl space leak

noughtmare opened this issue · 2 comments

The definition of chainl is:

chainl :: (b -> a -> b) -> ParserT st e b -> ParserT st e a -> ParserT st e b
chainl f start elem = start >>= go where
  go b = do {!a <- elem; go $! f b a} <|> pure b

But that use of <|> means we push a new frame to the stack for every iteration and that frame can keep references to a large amount of heap data.

I ran into this issue when implementing a parser for the "1 billion row challenge" (see this discourse post). I had an implementation like this:

parseMeasurements :: Parser () (M.HashMap BS.ByteString T)
parseMeasurements = chainl (\ !m (S k v) -> M.insertWith f k (T v v v 1) m) (pure M.empty) parseLine where
  f (T a b c d) (T a' b' c' d') = T (min a a') (max b b') (c+c') (d+d')

Which caused a major space leak slowing down my program from 1 second to over 12 seconds (with only 9.2% productivity) on an input of 10 million rows.

I've fixed it by rewriting my implementation to use a manual recursion:

parseMeasurements :: Parser () (M.HashMap BS.ByteString T)
parseMeasurements = go M.empty where
  go !m = withOption parseLine (\(S k v) -> go (M.insertWith f k (T v v v 1) m)) (pure m)
  f (T a b c d) (T a' b' c' d') = T (min a a') (max b b') (c+c') (d+d')

So I'd suggest changing the definition of chainl similarly, e.g.:

chainl :: (b -> a -> b) -> ParserT st e b -> ParserT st e a -> ParserT st e b
chainl f start elem = start >>= go where
  go b = withOption elem (\ !a -> go $! f b a) (pure b)

Thanks! You're totally right, I'll fix shortly.

Fixed by 851a39c.