v7 proposals

Here are some things I would like to see in v7 of this package.

The target design space for this package should be similar to MegaParsec: intended for users who prefer correctness and feature-completeness to speed. Anyone who wants speed in a V8 runtime environment will use the built-in Regex.

Text.Parsing.Parser

purescript-parsing/src/Text/Parsing/Parser.purs

Lines 52 to 53 in d085e37

    
           -- | Contains the remaining input and current position. 
        
           data ParseState s = ParseState s Position Boolean

Change the definition of ParseState so that we can have cursor-based state in parsers, and so that line-column state is optional.

Tracking the newline-based line and column position is an important feature but it’s expensive and rarely-used. I would like to try to make that optional.

I'd like to switch to a cursor-based state for String parsers, instead of a state which tracks “the remaining input”.

Do we need the Boolean “consumed flag” in the ParseState? As far as I can tell this is set but never tested. Nothing cares what the “consumed flag” value is?

Make the Position zero-based. #94

data ParseState s state = ParseState s state

Text.Parsing.Parser.Combinators

Add combinators manyTill, many1Till_ #108

Text.Parsing.Parser.String

UTF-16 correctness. We should always handle UTF-16 surrogate paris correctly, and that means always treating token as CodePoint instead of CodeUnit. #109
Delete the StringLike typeclass. Has anyone ever created an instance of this class for a type other than String?
Add combinator match #107

Text.Parsing.Parser.DataView

Add DataView parsing to this package? rowtype-yoga/purescript-parsing-dataview#10

Module names

Remove the Text. prefix from all module names.

@paf31 introduces StringLike #36 to “support more efficient string representations.”

What kind of representations? The only thing I can think of is something like CatList<String>, which could be a more efficient “string” representation if the the “string” is a large document which is being edited, or is being lazily read in chunks out of a large file?

We should combine parsing and string-parsers #69

https://github.com/Thimoteus/SandScript/wiki/2.-Parsing-recursively

Here's a parsing monad.

https://github.com/natefaubion/purescript-language-cst-parser/blob/main/src/PureScript/CST/Parser/Monad.purs

Here's a CPS purescript parsing monad

https://github.com/cdepillabout/cabal2nixWithoutIFD/blob/main/purescript-parser-combinator/src/Parsec.purs

Adapted from

https://github.com/jonascarpay/alloy/blob/master/src/Parser/Parsec.hs

Adapted from Parsec.

I would want this library to be as full-featured as possible, and to have these properties (which we mostly already have):

Stack-safe
Auto-backtracking (if a parser fails then it consumes no input)
Monad-transformable
Input streams extendable with build-in support for
- String UCS-2 Big-Endian
- String UCS-2 Little-Endian
- String UTF-16 Big-Endian
- String UTF-16 Little-Endian
- UInt8Array UTF-8
- DataView
- forall token. List<token>

Node.js only supports UTF-16 Little-Endian https://nodejs.org/api/buffer.html#buffer_buffers_and_character_encodings

https://mathiasbynens.be/notes/javascript-encoding

https://kevin.burke.dev/kevin/node-js-string-encoding/

The purescript-string-parsers Text.Parsing.StringParser.CodePoints module has the design decision to

Use a cursor in units of code points.
Return a Char.

A better design would be to

Use a cursor in units of code units (and increment by two for astral characters)
Return a CodePoint.

purescript-contrib/purescript-string-parsers#48

We could use this getWholeChar function.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/charAt#getting_whole_characters

Or actually the CodePoints.uncons might suffice

https://github.com/purescript/purescript-strings/blob/157e372a23e4becd594d7e7bff6f372a6f63dd82/src/Data/String/CodePoints.purs#L191-L191

Actually, maybe there is no performance improvement to be had with a cursor-based parser state?

purescript/purescript-strings#120

Instructions for how to parse a String with Regex, then switch to Parser, then switch back to Regex. This should be a supported use case, considering that Parser is 100× slower than Regex.

Also support the opposite case, with a parseRegex :: Regex -> ParserT String m (Array String).

More package properties

Does not release (does not free the memory of) input already consumed.
Does not allow for continuation of more input received (like Attoparsec).

purescript-parsing/src/Text/Parsing/Parser.purs

Lines 104 to 110 in 297ad9e

    
           instance altParserT :: Monad m => Alt (ParserT s m) where 
        
             alt p1 p2 = (ParserT <<< ExceptT <<< StateT) \(s@(ParseState i p _)) -> do 
        
               Tuple e s'@(ParseState _ _ c') <- runStateT (runExceptT (unwrap p1)) (ParseState i p false) 
        
               case e of 
        
                 Left _ 
        
                   | not c' -> runStateT (runExceptT (unwrap p2)) s 
        
                 _ -> pure (Tuple e s')

I think the purpose of the consumed flag is to do something like this?

Note that if p succeeds without consuming input the second alternative is favored if it consumes input. This implements the “longest match” rule.

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/parsec-paper-letter.pdf p.11

Except the way that I read this Alt instance is that it favors p2 if p1 failed and consumed no input.

	-- \| Contains the remaining input and current position.
	data ParseState s = ParseState s Position Boolean

	instance altParserT :: Monad m => Alt (ParserT s m) where
	alt p1 p2 = (ParserT <<< ExceptT <<< StateT) \(s@(ParseState i p _)) -> do
	Tuple e s'@(ParseState _ _ c') <- runStateT (runExceptT (unwrap p1)) (ParseState i p false)
	case e of
	Left _
	\| not c' -> runStateT (runExceptT (unwrap p2)) s
	_ -> pure (Tuple e s')