[proposed labels: question, feature request] best practices for stateful matching of simple patterns

Question

[proposed labels: question, feature request] best practices for stateful matching of simple patterns

genovese opened this issue 2 years ago · 4 comments

Megaparsec is terrific: powerful, flexible, a joy to use. I've been making heavy use in several projects. Thanks!

There are two needs that keep coming up, however, and I'm wondering if I am possibly missing some best practices that can obviate them. First, I keep wanting to use something like takeWhile1P but with various conditions based on the tokens matched. The fast, stateful scanner along these lines requested in issue #314 would fit the bill. Second, I would like a (backtracking) primitive that matches a specified regular expression, even a simple POSIX style without any PCRE fanciness.

I recognize that one can use combinators to mimic the typical regex operators, but when matching higher-level syntactic constructs with variations on the form of their components, this tends to introduce more extra complexity than I would like. For instance, if I'm matching symbols that can start with one set of characters and continue with additional characters in a larger set, I end up with something like this (removing context and other structure):

isSymbolLeadingChar :: Char -> Bool
isSymbolLeadingChar c = isAlphaNum c || T.elem c symbolLeadingChars

isSymbolLaterChar :: Char -> Bool
isSymbolLaterChar c = isAlphaNum c || T.elem c symbolLaterChars

mySymbol :: Parser Text  
mySymbol = liftM2 (<>) symbol1 symbol2
  where symbol1 = takeWhile1P (Just "Symbol") isSymbolLeadingChar
        symbol2 = takeWhileP (Just "Symbol continued") isSymbolLaterChar

This works fine, but it seems a lot of boilerplate for a simple idea. And even with just a few categories like this, things ends up more diffuse and messy. (A few provisos. In some cases, I can grab a more general construct, classify and wrap it accordingly or fail. But when matching particular constructs in context -- such as having a list of the particular kind of symbol above -- it's easier to have a specific parser. I also realize that I can use something like Alex with a custom token type to handle lexing, but there are times when I'd rather keep it all in the family, so to speak.) A stateful scanner primitive would help a bit here, but a simple regex matcher would be even more convenient in this case. (I'd love to see both those additions.)

My question is if there is a better approach within the intended megaparsec idioms to capture simple patterns like this.

I hope this is all clear. Thanks for your help

Answer 1 · 2023-01-16T09:59:43.000Z

I agree that scanP would be helpful here, so I'd count this issue as a supporting case for #314. AFAIA you are not missing anything, except in your example I think you intend to write:

mySymbol :: Parser Text  
mySymbol = liftM2 (<>) symbol1 symbol2
  where symbol1 = Text.singleton <$> (satisfy isSymbolLeadingChar <?> "Symbol")
        symbol2 = takeWhileP (Just "Symbol continued") isSymbolLaterChar

Since I imagine the predicate isSymbolLeadingChar applies only to the first char, not to N first characters.

Answer 2 · 2023-01-16T15:08:39.000Z

Thanks.

On the example, since the second set is a superset of the first, I take as many as possible from the first set while I'm doing so, which is why I did it that way.

Thoughts on the regex matcher?

Answer 3 · 2023-01-16T15:30:00.000Z

Sorry, I am not aware of anything that brings regexp support to Megaparsec. Perhaps you could look into lexing with alex or similar.

Answer 4 · 2023-01-17T01:35:24.000Z

Understood. That part was a feature request. Thanks though, all good