mrkkrp/megaparsec

Space leak in reachOffset'

tomjaguarpaw opened this issue · 6 comments

The SourcePos field of St is lazy:

data St = St SourcePos ShowS

This causes a space leak in reachOffset':

St spos f = foldl'' go (St pstateSourcePos id) pre
go (St apos g) ch =
let SourcePos n l c = apos
c' = unPos c
w = unPos pstateTabWidth
in if
| ch == newlineTok ->
St
(SourcePos n (l <> pos1) pos1)
id
| ch == tabTok ->
St
(SourcePos n l (mkPos $ c' + w - ((c' - 1) `rem` w)))
(g . (fromTok ch :))
| otherwise ->
St
(SourcePos n l (c <> pos1))
(g . (fromTok ch :))

Although the St constructor remains evaluated each time around the foldl' iteration, the SourcePos inside does not. Instead it accumulates a large thunk.

Now, this doesn't affect megaparsec itself, because it only uses reachOffset' to derive reachOffsett methods and the reachOffset methods are never used within the library itself. However, a problem can arise if a client defines a TraversableStream whose reachOffsetNoLine method is derived from its reachOffset. If that reachOffset uses reachOffset' directly, or indirectly by using one of the built in TraversableStream instances, then the space leak will manifest itself.

Given that reachOffset seems no longer used by megaparsec perhaps the best long term fix is to deprecate then remove that method. In the short term perhaps you could make that field strict again.

The SourcePos field was made lazy in e307266 but I don't understand why. I can hardly imagine that allowing this space leak is good for performance.

IIRC this was a conscious decision. reachOffset is supposed to be used only when parsing is finished as part of error reporting. Its performance therefore doesn't matter as much because it will be evaluated only in "parse failure" branches of execution flow. I remember that I profiled some parsers and making that field lazy gave some noticeable performance improvement, so I made it lazy.

I'm open to either updating the documentation to mention this possible performance problem or making the field strict provided the benchmarks are not getting any worse.

It may be that my memory deceives me though. Happy to make this strict provided there is no performance regression.

the reachOffset methods are never used within the library itself

I just realised this isn't true. reachOffset is used in errorBundlePretty, so it's even more imperative to fix this issue.

Is there any circumstance in which one would call reachOffset but not evaluate the PosState it returns? If not then it makes no sense to make that field of St lazy. If it's lazy you still do exactly as much computation updating the PosState, plus additional computation creating and entering thunks, plus consume additional amounts of memory proportional to o - pstateOffset. The constant factor on the memory usage is large, tens of bytes at least, so if you are parsing a 10 MB file you can easily consume 1 GB just to report errors.

master

Uses 2.4 MB to report errors in a 10 kB input stream, an overhead of 24x! (Naturally, the size of the space leak scales linearly with the size of the input file.)

screenshot0

Strict SourcePos in St

The space leak is gone.

screenshot1

Code

https://github.com/tomjaguarpaw/megaparsec-bug

Would you like to open a PR?