purescript-contrib/purescript-parsing

The `regex` API makes it very difficult to catch syntax errors in the regex.

Closed this issue · 6 comments

The regex API defers maps a syntax error to a call to fail. Unfortunately this means if you use regex with alt, this failure will probably just get swallowed, and the parser will just fallthrough to the next alternative. This makes it very difficult to track down these errors. I experienced this when writing the json parser benchmarks, which uses regex.

Some alternatives may be:

  • Change the signature to yield a Either String (ParserT String m String), and have unsafeRegex functions which use unsafeCrashWith. This will mean that syntax errors will fail with a hard exception at least, and you'll have a stack trace to the regex call.
  • Change the failure implementation to consume $ fail ... which will set the consume bit so it's less likely to fallthrough. Unfortunately this doesn't work if it's wrapped in try, and it gives you no indication where the error comes from.
  • Provide some sort of hard failure mode in the parser implementation so that it's impossible to recover from it. This still doesn't give you a very good indication where it came from though.

That's a good point. I'm leaning towards doing your first alternative.

I'm also wondering if there is a sensible way to require a RegExp compilation step before creating a regex parser.

Maybe something like this?

-- | Unexported constructor
newtype ParserRegex = ParserRegex Regex

-- | Compile the regular expression pattern string to get an instance of `ParserRegex` needed to make a `regex` parser.
-- |
-- | Call this function as far in advance as possible. Ideally call it outside of the `ParserT` monad.
-- | If you call this function inside of the `ParserT` monad and then `fail` the parse when the compilation fails,
-- | then that could be confusing because a parser failure is supposed to indicate an invalid input string.
-- | If the compilation failure occurs in an `alt` then the compilation failure might not be reported at all and instead
-- | the input string would be parsed incorrectly.
compileParserRegex :: String -> RegexFlags -> Either String ParserRegex
compileParserRegex pattern flags = ParserRegex <$> String.Regex.regex ("^(" <> pattern <> ")") flags

-- | Parser which uses the `Data.String.Regex` module to parse a `String`.
regex :: forall m. ParserRegex -> ParserT String m String

I think that would depend on if you are imagining an additional API on top of ParserRegex, or additional restrictions. For example, only being able to construct it with specific flags. As you said before, I think dotAll, unicode, and ignoreCase are the only relevant flags for a parser regex.

I think that would depend on if you are imagining an additional API on top of ParserRegex, or additional restrictions.

Yeah, there is one important restriction on ParserRegex, which is that the pattern must begin with "^". That’s the only reason for making it a newtype. Otherwise we could allow constructing a parser with a Regex.

If all you want to do is provide a constructor with no additional APIs or API restrictions, then the newtype is equivalent to just building the Parser directly (ie, it doesn't give you any additional expressiveness since all you can do is pass it to the function). I personally think restricting the flags would be useful if you went the newtype route.

If all you want to do is provide a constructor with no additional APIs or API restrictions, then the newtype is equivalent to just building the Parser directly

Oh right, it’s equivalent to your first alternative, so it would be

regex :: String -> RegexFlags -> Either String (ParserT String m String)
-- | Compile a regular expression string into a regular expression parser.
-- |
-- | This function will use the `Data.String.Regex.regex` function to compile and return a parser which can be used 
-- | in a `ParserT String` monad.
-- |
-- | This function should be called before the `ParserT String` monad is entered, because this function might 
-- | fail with a `Left` RegExp compilation error message.
-- | If you call this function inside of the `ParserT` monad and then `fail` the parse when the compilation fails,
-- | then that could be confusing because a parser failure is supposed to indicate an invalid input string.
-- | If the compilation failure occurs in an `alt` then the compilation failure might not be reported at all and instead
-- | the input string would be parsed incorrectly.
-- |
-- | #### Example
-- |
-- | This example shows how to compile and run the `xMany` parser which will capture the regular expression pattern `x*`.
-- |
-- | ```purescript
-- | case mkRegex "x*" noFlags of
-- |   Left compileError -> unsafePerformEffect $ throw $ "xMany failed to compile: " <> compileError
-- |   Right xMany -> case runParser "xxxZ" xMany of
-- |     Left (ParseError parseError _) -> -- parse failed
-- |     Right capture -> -- capture should be "xxx"
-- | ```
-- |
-- | #### Flags
-- |
-- | Set `RegexFlags` with the `Semigroup` instance like this.
-- |
-- | ```purescript
-- | mkRegex "x*" (dotAll <> ignoreCase)
-- |
-- | The `dotAll`, `unicode`, and `ignoreCase` flags might make sense for a `mkRegex` parser. The other flags will 
-- | probably cause surprising behavior and you should stay away from them.
mkRegex :: String -> RegexFlags -> Either String (ParserT String m String)