mrkkrp/megaparsec

Unexpected input-sharing behaviour of parsers

EggBaconAndSpam opened this issue · 3 comments

I finally managed to hunt down a memory leak which arose from the following arguably surprising behaviour:

Running either of these parsers and holding on to their results will keep the entire input in memory:

import Data.Text (Text)
import qualified Data.Text.Lazy as TL

testSharingStrict :: Parsec Void Text Text
testSharingStrict = takeP Nothing 1

testSharingLazy :: Parsec Void TL.Text TL.Text
testSharingLazy =
  fmap TL.concat . many . try $ takeP Nothing 1 <* takeP Nothing (TL.defaultChunkSize - 1)

This is due to the sharing semantics of both strict and lazy text. (I believe the same applies to strict and lazy bytestrings.)

Having been bitten by this I'm struggling to come up with a scenario where we actually gain anything from sharing memory with the input stream. In the strict case that always keeps the entire input in memory, while in the lazy case we at least keep a whole chunk (of 16K) for each 'output' (which may be much smaller than that).

It seems to me we could get rid of the input-sharing behaviour be adding a few copy :: Text -> Text in the right places:

instance Stream T.Text where
  ...
  takeN_ n s
    | n <= 0 = Just (T.empty, s)
    | T.null s = Nothing
    | otherwise = Just (first T.copy $ T.splitAt n s)
  takeWhile_ = first T.copy $ T.span

Why is that a bad idea, and/or what am I missing?

One way to get rid of this foot-gun:

  • add copy to the right places of instance Stream Text etc. to not share default
  • add a Shared newtype wrapper s.t. instance Stream (Shared Text) retains the current sharing behaviour, as an optional optimisation (with disclaimer)

What do you think?

This is a valid concern, thanks for bringing this up. I think I originally wanted combinators like takeN_ to be as fast as possible and for that reason I used splitAt. I was much more preoccupied by raw speed than by the potential danger of sharing long input streams. I'd be curious to know how adding copy affects the benchmarks. If it is not too bad, which it could well be, we might do better by just adding copy in the right places and avoid sharing.

Would you have time to add copy and run the benchmarks that are included with the library (before/after the change)?

Thanks for having a look at this!

I'll get you a PR / more data by the end of this week (this is mainly a deadline for myself to not postpone this indefinitely).