Space leak in reachOffset'
tomjaguarpaw opened this issue · 6 comments
The SourcePos
field of St
is lazy:
megaparsec/Text/Megaparsec/Stream.hs
Line 358 in cd6302d
This causes a space leak in reachOffset':
megaparsec/Text/Megaparsec/Stream.hs
Lines 419 to 436 in cd6302d
Although the St
constructor remains evaluated each time around the foldl'
iteration, the SourcePos
inside does not. Instead it accumulates a large thunk.
Now, this doesn't affect megaparsec
itself, because it only uses reachOffset'
to derive reachOffsett
methods and the reachOffset
methods are never used within the library itself. However, a problem can arise if a client defines a TraversableStream
whose reachOffsetNoLine
method is derived from its reachOffset
. If that reachOffset
uses reachOffset'
directly, or indirectly by using one of the built in TraversableStream
instances, then the space leak will manifest itself.
Given that reachOffset
seems no longer used by megaparsec
perhaps the best long term fix is to deprecate then remove that method. In the short term perhaps you could make that field strict again.
The SourcePos
field was made lazy in e307266 but I don't understand why. I can hardly imagine that allowing this space leak is good for performance.
IIRC this was a conscious decision. reachOffset
is supposed to be used only when parsing is finished as part of error reporting. Its performance therefore doesn't matter as much because it will be evaluated only in "parse failure" branches of execution flow. I remember that I profiled some parsers and making that field lazy gave some noticeable performance improvement, so I made it lazy.
I'm open to either updating the documentation to mention this possible performance problem or making the field strict provided the benchmarks are not getting any worse.
It may be that my memory deceives me though. Happy to make this strict provided there is no performance regression.
the
reachOffset
methods are never used within the library itself
I just realised this isn't true. reachOffset
is used in errorBundlePretty
, so it's even more imperative to fix this issue.
Is there any circumstance in which one would call reachOffset
but not evaluate the PosState
it returns? If not then it makes no sense to make that field of St
lazy. If it's lazy you still do exactly as much computation updating the PosState
, plus additional computation creating and entering thunks, plus consume additional amounts of memory proportional to o - pstateOffset
. The constant factor on the memory usage is large, tens of bytes at least, so if you are parsing a 10 MB file you can easily consume 1 GB just to report errors.
master
Uses 2.4 MB to report errors in a 10 kB input stream, an overhead of 24x! (Naturally, the size of the space leak scales linearly with the size of the input file.)
Strict SourcePos
in St
The space leak is gone.
Code
Would you like to open a PR?
OK: #487