mysticatea/regexpp

Visitor: improve usability

conartist6 opened this issue · 6 comments

The current visitor has some problems. The biggest is that there's no way to visit every expression, where an expression might be the content of a group or of a lookahead or lookbehind. All expressions have common needs -- in particular the alternatives need to be evaluated to know if the expression matches, after which different actions are appropriate depending on the specific type. The worst offender is the assertion type, because it may or may not be an expression at all, depending on whether the kind is lookahead or lookbehind. This may be fine for a backtracking engine since these can evaluate the assertion inside their own block scope, but it makes life fabulously difficult for non-backtracking engines which must always track the states until the input reveals whether the subexpression matches or fails.

OK I was able to get much better results by writing my own visitor. I tried a couple variations, but what ended up being best was going all the way back to the simplest basics where a visitor for a type is responsible for propagating the visit forwards itself. This had huge advantages:

  • I was able to easily abstract out some code to deal with alternatives.
  • I could prevent traversal into structures I consider atomic, i.e. those that try to match against an input character (like character classes). Characters are sometimes atomic (so I must have a visitor for them) but inside character classes they are not. Avoiding traversing into the character class allowed me to avoid a complex conditional to determine whether a character was atomic.
  • I was able to make visit return a value, which is super useful. Now instead of maintaining structure outside the traversal I can use pure functional composition. This greatly simplified the implementation of repetition, and made almost all the code more readable.
  • Very fast. There are no conditionals at all left and no more wrapper function calls. Many fewer lines of code.

I do pay a bit of a cost of course as I have to maintain the code that propagates the traversal forwards, but having the return values ensures it's pretty obvious if something is missing or incorrect.

Here is my traversal code.

Hi @conartist6!

Since this repo is unmaintained, you might want to re-open this issue in the @eslint-community fork https://github.com/eslint-community/regexpp

For more info about why we created this organization, you can read https://eslint.org/blog/2023/03/announcing-eslint-community-org

@MichaelDeBoey Cool, I should probably upgrade my projects to use the one that you're maintaining then!

@conartist6 It's indeed advised to update your projects to use the community fork instead of using the unmaintained original project

@MichaelDeBoey Ah thanks! I actually went the other way and wrote my own regex parser, uh, and also my own AST data structure and my own parser framework

@conartist6 That's another possibility of course 🙈