v7 proposals
jamesdbrock opened this issue · 10 comments
Here are some things I would like to see in v7 of this package.
The target design space for this package should be similar to MegaParsec: intended for users who prefer correctness and feature-completeness to speed. Anyone who wants speed in a V8 runtime environment will use the built-in Regex.
Text.Parsing.Parser
purescript-parsing/src/Text/Parsing/Parser.purs
Lines 52 to 53 in d085e37
Change the definition of ParseState so that we can have cursor-based state in parsers, and so that line-column state is optional.
Tracking the newline-based line and column position is an important feature but it’s expensive and rarely-used. I would like to try to make that optional.
- I'd like to switch to a cursor-based state for
Stringparsers, instead of a state which tracks “the remaining input”.
Do we need the Boolean “consumed flag” in the ParseState? As far as I can tell this is set but never tested. Nothing cares what the “consumed flag” value is?
- Make the
Positionzero-based. #94
data ParseState s state = ParseState s stateText.Parsing.Parser.Combinators
- Add combinators
manyTill,many1Till_#108
Text.Parsing.Parser.String
- UTF-16 correctness. We should always handle UTF-16 surrogate paris correctly, and that means always treating token as
CodePointinstead ofCodeUnit. #109 - Delete the
StringLiketypeclass. Has anyone ever created an instance of this class for a type other thanString? - Add combinator
match#107
Text.Parsing.Parser.DataView
- Add
DataViewparsing to this package? rowtype-yoga/purescript-parsing-dataview#10
Module names
- Remove the
Text.prefix from all module names.
@paf31 introduces StringLike #36 to “support more efficient string representations.”
What kind of representations? The only thing I can think of is something like CatList<String>, which could be a more efficient “string” representation if the the “string” is a large document which is being edited, or is being lazily read in chunks out of a large file?
We should combine parsing and string-parsers #69
Here's a parsing monad.
Here's a CPS purescript parsing monad
Adapted from
https://github.com/jonascarpay/alloy/blob/master/src/Parser/Parsec.hs
Adapted from Parsec.
I would want this library to be as full-featured as possible, and to have these properties (which we mostly already have):
- Stack-safe
- Auto-backtracking (if a parser fails then it consumes no input)
- Monad-transformable
- Input streams extendable with build-in support for
StringUCS-2 Big-EndianStringUCS-2 Little-EndianStringUTF-16 Big-EndianStringUTF-16 Little-EndianUInt8ArrayUTF-8DataViewforall token. List<token>
Node.js only supports UTF-16 Little-Endian https://nodejs.org/api/buffer.html#buffer_buffers_and_character_encodings
The purescript-string-parsers Text.Parsing.StringParser.CodePoints module has the design decision to
- Use a cursor in units of code points.
- Return a
Char.
A better design would be to
- Use a cursor in units of code units (and increment by two for astral characters)
- Return a
CodePoint.
purescript-contrib/purescript-string-parsers#48
We could use this getWholeChar function.
Or actually the CodePoints.uncons might suffice
Actually, maybe there is no performance improvement to be had with a cursor-based parser state?
Instructions for how to parse a String with Regex, then switch to Parser, then switch back to Regex. This should be a supported use case, considering that Parser is 100× slower than Regex.
Also support the opposite case, with a parseRegex :: Regex -> ParserT String m (Array String).
More package properties
- Does not release (does not free the memory of) input already consumed.
- Does not allow for continuation of more input received (like Attoparsec).
purescript-parsing/src/Text/Parsing/Parser.purs
Lines 104 to 110 in 297ad9e
I think the purpose of the consumed flag is to do something like this?
Note that if p succeeds without consuming input the second alternative is favored if it consumes input. This implements the “longest match” rule.
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/parsec-paper-letter.pdf p.11
Except the way that I read this Alt instance is that it favors p2 if p1 failed and consumed no input.