Ignore
conartist6 opened this issue · 12 comments
Something is wrong with ignore. Right now I want to push an ignore context onto the ignore stack that ignores nothing, but that makes parsing never terminate. Looking at the code is making my head spin. There seems to be a half-baked workaround for the fact that the logic is circular. Sometimes null is pushed onto the ignore stack, which is unfortunate as the code that consumes the ignore stack assumes that it contains only functions. But fortunately it always pushes and pops the ignore stack in twos. If I just rip out that craziness, then I'm left with straight up infinite recursion, which actually makes a lot of sense.
Sorry I missed this issue somehow. Ignore(null, rule) will indeed push null to the ignore stack which tells the parser to ignore nothing. This mechanism is used for instance in JSexpr grammar for TemplateLiteral.
I believe this is not per se what makes your parsing never terminate, but more likely an issue with your grammar that perhaps allows situations where the parser is unable to make any progress, e.g. Plus(Optional(whatever)) would never terminate for this reason.
Shouldn't this line return?
The first argument to Ignore can be any rule, e.g.
Any(
/^\s+/,
/^\/\/[^\n]*/, // line comment
/^\/\*(?:\*(?!\/)|[^*])*\*\// // block comment
);which should ignore all comments and whitespace in any combination. If this line just returned, I would need to either write the rule above as a very long RegEx, or directly as Plus(Any(...)). However, Plus itself is using the ignore stack, so the latter would result in infinite recursion: that's why I need to push null on the ignore stack first: Ignore(null, Plus(Any(...))).
Turns out, this is a very common pattern for Ignore, so I just decided to put the Ignore(null, Plus(actualIgnoreRule)) wrapper inside Ignore directly.
Aaaaaaaah, so that's the infinite recursion I was getting. It makes sense now. That's exactly the kind of information that would make a great code comment. Also documentation would have made clearer what you have just shared about the possibility of ignore taking a whole rule.
My fork definitely doesn't support using a rule as an ignore pattern, so I need to go back and figure out what the root problem was that I was attempting to address in the first place. It may just be that I didn't know about Ignore(null, ...) (also undocumented). And without unit tests I figured there could easily be bugs, thought it's hard to say what's a bug without knowing what the API is supposed to be... But anyway it sounds like in this case I should put the original logic back (with comments) as it was working as intended.
Other than that, are you interested in merging my fork back into rd-parse? From my README here are the changes I made:
- Undocumented exports are removed, particularly
START, andUse. - Regex token changes:
- Capture groups no longer have meaning. Instead the full match result is used.
- required
^is now prepended to expression for you
Ignorerenamed toIgnoreBetween.Ignoreis a new rule type.
OK nevermind I think I'm going to keep going in my direction for a while on the fork. I want to see what happens if I eliminate native regex execution completely. I like the idea a lot for a couple reasons:
- The resulting tool could then be ported in a fairly straightforward fashion between multiple languages.
- I get assurance that I'm traversing the input character by character, which allows me to create more features at the library level, e.g. methods that support bracket matching and escapes. Hopefully code written in such a way would be easier to read and use as it could come with documentation and examples, tests, and might be able to produce more detailed errors.
Some regex functionality will shift to library code, e.g. I'll need to be able to write CharClass('a-zA-Z0-9_-'). This will require some parsing code but otherwise does not seem problematic to me.
Sorry, we are using rd-parse extensively in production at ellx.io, and I just have no spare cycles at this moment to work on any breaking changes.
- I am using
STARTeven if it's not documented (I would be happier to accept a fix for the documentation instead of removing this functionality :)) - Removing the need for obligatory
^is good: I'd accept an improvement like this - Capture groups are absolutely crucial: I don't understand why you'd want to remove those.
In general, I understand why you want to go to a character-level scanner: it would indeed make the abstractions cleaner. However, regexes achieve 2 major things:
- most importantly, performance. I spent weeks profiling grammars in production, and you can easily check it yourself: even passing from
/^(?!foo|bar|baz)/toAny('foo', 'bar', 'baz')would currently slow down your parser by something like 20-30% on large files. You can imagine the hit you would take if you go to a single-character scanner, basically re-implementing the JS built-in RegEx engine usingrd-parse... You want your low-level scanner to eat as much of the input in one go as possible: it is a trade-off between readability/flexibility and performance. - Another thing is regexes are generally more concise for simple tokens, which makes your grammar more readable IMO
Currently, you do not HAVE to use regexes. If you still want to use a single character scanner, you already can: it should just be part of your grammar, not the parser generator, I believe.
That's ok. That's why I forked! I don't think it's a big deal for such a small codebase.
What are you using START for? I couldn't imagine any scenario in which it would be needed, but that's a lack of imagination clearly.
The perf thing makes sense. Good to know you've done some testing. I actually have implemented the JS regex engine now so I know how much slower that code can be. My current goal isn't speed exactly; I'm achieving performance through clever limitation of the amount of work done, specifically by using my regex engine to isolate only the small part of each file (a header comment) that I need to parse. I wonder if it might be possible to reclaim a lot of perf by shifting work to WebAssembly, which would become possible when native regex usage disappeared.
Ah here we go: it could be ported to AssemblyScript. I've never touched WASM so it'd be a cool learning experience for me.
Without knowing anything for sure, I believe it would still be fine to write the grammar in JS. You'd just be calling (JS) functions that return references to WASM functions instead of JS functions. Then the WASM functions would call each other. Actually closures probably won't work in WASM, but single-method class instances would presumably be fine.
Closing this for now...