purescript-contrib/purescript-parsing

v7 proposals

jamesdbrock opened this issue · 10 comments

Here are some things I would like to see in v7 of this package.

The target design space for this package should be similar to MegaParsec: intended for users who prefer correctness and feature-completeness to speed. Anyone who wants speed in a V8 runtime environment will use the built-in Regex.

Text.Parsing.Parser

-- | Contains the remaining input and current position.
data ParseState s = ParseState s Position Boolean

Change the definition of ParseState so that we can have cursor-based state in parsers, and so that line-column state is optional.

Tracking the newline-based line and column position is an important feature but it’s expensive and rarely-used. I would like to try to make that optional.

  1. I'd like to switch to a cursor-based state for String parsers, instead of a state which tracks “the remaining input”.

Do we need the Boolean “consumed flag” in the ParseState? As far as I can tell this is set but never tested. Nothing cares what the “consumed flag” value is?

  1. Make the Position zero-based. #94
data ParseState s state = ParseState s state

Text.Parsing.Parser.Combinators

  1. Add combinators manyTill, many1Till_ #108

Text.Parsing.Parser.String

  1. UTF-16 correctness. We should always handle UTF-16 surrogate paris correctly, and that means always treating token as CodePoint instead of CodeUnit. #109
  2. Delete the StringLike typeclass. Has anyone ever created an instance of this class for a type other than String?
  3. Add combinator match #107

Text.Parsing.Parser.DataView

  1. Add DataView parsing to this package? rowtype-yoga/purescript-parsing-dataview#10

Module names

  1. Remove the Text. prefix from all module names.

@paf31 introduces StringLike #36 to “support more efficient string representations.”

What kind of representations? The only thing I can think of is something like CatList<String>, which could be a more efficient “string” representation if the the “string” is a large document which is being edited, or is being lazily read in chunks out of a large file?

We should combine parsing and string-parsers #69

I would want this library to be as full-featured as possible, and to have these properties (which we mostly already have):

  • Stack-safe
  • Auto-backtracking (if a parser fails then it consumes no input)
  • Monad-transformable
  • Input streams extendable with build-in support for
    • String UCS-2 Big-Endian
    • String UCS-2 Little-Endian
    • String UTF-16 Big-Endian
    • String UTF-16 Little-Endian
    • UInt8Array UTF-8
    • DataView
    • forall token. List<token>

Node.js only supports UTF-16 Little-Endian https://nodejs.org/api/buffer.html#buffer_buffers_and_character_encodings

https://mathiasbynens.be/notes/javascript-encoding

https://kevin.burke.dev/kevin/node-js-string-encoding/

The purescript-string-parsers Text.Parsing.StringParser.CodePoints module has the design decision to

  • Use a cursor in units of code points.
  • Return a Char.

A better design would be to

  • Use a cursor in units of code units (and increment by two for astral characters)
  • Return a CodePoint.

purescript-contrib/purescript-string-parsers#48

We could use this getWholeChar function.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/charAt#getting_whole_characters

Or actually the CodePoints.uncons might suffice

https://github.com/purescript/purescript-strings/blob/157e372a23e4becd594d7e7bff6f372a6f63dd82/src/Data/String/CodePoints.purs#L191-L191

Actually, maybe there is no performance improvement to be had with a cursor-based parser state?

purescript/purescript-strings#120

Instructions for how to parse a String with Regex, then switch to Parser, then switch back to Regex. This should be a supported use case, considering that Parser is 100× slower than Regex.

Also support the opposite case, with a parseRegex :: Regex -> ParserT String m (Array String).

More package properties

  • Does not release (does not free the memory of) input already consumed.
  • Does not allow for continuation of more input received (like Attoparsec).

instance altParserT :: Monad m => Alt (ParserT s m) where
alt p1 p2 = (ParserT <<< ExceptT <<< StateT) \(s@(ParseState i p _)) -> do
Tuple e s'@(ParseState _ _ c') <- runStateT (runExceptT (unwrap p1)) (ParseState i p false)
case e of
Left _
| not c' -> runStateT (runExceptT (unwrap p2)) s
_ -> pure (Tuple e s')

I think the purpose of the consumed flag is to do something like this?

Note that if p succeeds without consuming input the second alternative is favored if it consumes input. This implements the “longest match” rule.

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/parsec-paper-letter.pdf p.11

Except the way that I read this Alt instance is that it favors p2 if p1 failed and consumed no input.