Unexpected input-sharing behaviour of parsers
EggBaconAndSpam opened this issue · 3 comments
I finally managed to hunt down a memory leak which arose from the following arguably surprising behaviour:
Running either of these parsers and holding on to their results will keep the entire input in memory:
import Data.Text (Text)
import qualified Data.Text.Lazy as TL
testSharingStrict :: Parsec Void Text Text
testSharingStrict = takeP Nothing 1
testSharingLazy :: Parsec Void TL.Text TL.Text
testSharingLazy =
fmap TL.concat . many . try $ takeP Nothing 1 <* takeP Nothing (TL.defaultChunkSize - 1)
This is due to the sharing semantics of both strict and lazy text. (I believe the same applies to strict and lazy bytestrings.)
Having been bitten by this I'm struggling to come up with a scenario where we actually gain anything from sharing memory with the input stream. In the strict case that always keeps the entire input in memory, while in the lazy case we at least keep a whole chunk (of 16K) for each 'output' (which may be much smaller than that).
It seems to me we could get rid of the input-sharing behaviour be adding a few copy :: Text -> Text
in the right places:
instance Stream T.Text where
...
takeN_ n s
| n <= 0 = Just (T.empty, s)
| T.null s = Nothing
| otherwise = Just (first T.copy $ T.splitAt n s)
takeWhile_ = first T.copy $ T.span
Why is that a bad idea, and/or what am I missing?
One way to get rid of this foot-gun:
- add
copy
to the right places ofinstance Stream Text
etc. to not share default - add a
Shared
newtype wrapper s.t.instance Stream (Shared Text)
retains the current sharing behaviour, as an optional optimisation (with disclaimer)
What do you think?
This is a valid concern, thanks for bringing this up. I think I originally wanted combinators like takeN_
to be as fast as possible and for that reason I used splitAt
. I was much more preoccupied by raw speed than by the potential danger of sharing long input streams. I'd be curious to know how adding copy
affects the benchmarks. If it is not too bad, which it could well be, we might do better by just adding copy
in the right places and avoid sharing.
Would you have time to add copy
and run the benchmarks that are included with the library (before/after the change)?
Thanks for having a look at this!
I'll get you a PR / more data by the end of this week (this is mainly a deadline for myself to not postpone this indefinitely).