datalust/superpower

Parsing confusion: Zero-width parsers, what rewinds and when, and properly returning a "failed" parse?

nitz opened this issue ยท 4 comments

nitz commented

Hello!

I want to preface that my knowledge and skill with grammar structure is novice at best. On top of that, despite writing them, expression trees with linq queries still feel like magic, and I don't comprehend the mechanisms that make them work but merely accept that they do. That said, please feel free to beat me over the head with the "you're doing it wrong dummy, do it like this" hose! ๐Ÿ™‚

Preface

I'm setting up a parser for a small declarative language, based on SDLang, but with my own twists and flair. I found the 'JsonParser' example to be incredibly through and at least similar enough from my task at hand to use as a guide when setting things up.

My first big headscratcher is the order of initialization. It took me some jugging in the parser section to figure it out, but the Builder bit of the tokenizer seemed to ensure I didn't have that issue there, I think.

Oh, worth noting I'm on v3.0.0-dev-189 as I write this, but I noticed that apparently Nuget didn't snag the latest when I installed it earlier, so I will be updating after I post this.

Tokenizing

I felt like I grasped the concept of the tokenizing step fairly well, with two exceptions:

  • I'm still not sure how much or how little should actually be tokenized. E.g.: In my first implementation I went too far and tokenized things like comment starts //, which would be cool, except I couldn't figure out how to continually consume tokens regardless of type, so I had to go back on that. I do intend to try and draft an ABNF grammar for my language at some point, which I hope will help me understand what is better to tokenize? Related to that, I couldn't determine what would make something like "quotes on a string be part of the string token, but then opt to let { braces be their own token?

  • The other, which leans into the latter half of my post title โ€” I am still having a hard time figuring out what backtracks and when? Is it only Try()? I read through some of the code and it felt like Try was setting the flag to backtrack. If this is in fact the case, I'm wondering if it's something about how I'm structuring things that makes me feel like I don't have nearly as many Try()s as I think I should in my head (esp on the parsing side.)

I got the tokenizer working pretty well, in a way that pleased me by it's chops when printing out the results. So other than my first point there, I think I'm decent on that front.

Value Converting

I'll be brief on this one โ€” my TextParser<T>s mostly haven't been used yet. I've a handful of unit tests set up to make sure they handle the basics right, but this is more of a future me problem.

Parsing The Tree

Using the SDL example on their homepage as a target for now, briefly, my document is intended to be structured like this: The Document is made up of 0 or more Elements. The Element is one of Node, or Comment. (Currently Comments are tokenizing as an entire token. I may want to change that later, but is a low-shelf feature.) Nodes are composed of zero or one Name, zero to N Properties in the form identifier=value and zero to N subnodes in the form { /* same as node content*/ }. The node properties are collectively referred to as Arguments. I've plans to get more complicated, but that's a good surface area for now.

My parsing structure, as I understand how I've got it setup looks like this (pseudocode

Document
    - Elements (Many, AtEnd)
        - Node (Try)
            - Name (Try, Optional)
                - (Text Parser)
            - Argument (Try, Many)
                - Property (Try)
                    - (...parsers...)
                - Value (Try)
                    - (...parsers...)
                - Subnode (Try)
                    - Ref: Node
        - (Or) Comment

Hopefully that makes sense.

Anyways, I've gone through several different attempts at writing these first level parsers. I ran into a few issues like stack overflows (which was surprisingly difficult to track down where!), and a few other obvious mistakes I've made, but I'm to the point now where I'm pretty stumped.

  • Parsing barely starts before I'm met with a Many() cannot be applied to zero-width parsers. I even went so far as to take out each Many I was using just to try, but it seemed that there was an implicit one even when using AtLeastOnce(), so I've not been able to shake it.

That does, however make me think that my problem is further down the tree for the more granular cases, and I'm wondering if I'm not consuming the tokens where I feel like I should be. The case I'm imagining is giving me the most trouble is building the node. I've rewritten it in a number of ways, but none of which seem to help me out.

Rather than blowing up this ticket even more with snippets, I threw up a number of my bits in a gist, here:

Gist of some of my TokenParsers

Any of the things named Something_TryX is just alternates I've tried for those same parsers.

Hopefully that wasn't all too terribly much and someone can point me in the right direction!

Thanks for reading, and for the awesome library!

Hi! Thanks for dropping by. SDLang looks like a nice little DSL, hopefully writing the parser will turn out to be fun :-)

I'm on pretty limited time to reply - have a couple of points that may help (below), but I'll circle back when I can, so if you end up with more questions feel free to add them here (though replies could be slow, sorry).

Try and backtracking - it's a bit long in the tooth, now, but reading this short series of blog posts by Brian McNamara should get you all the info you need; Superpower uses a slightly different mechanism under the hood but the Try() combinator is essentially the one Brian describes.

Zero-length matches - imagine you have a parser A that parses strings like aaaaa and returns the number of as in them. A.Parse("aaa") would be 3, while A.Parse("") would be 0. Now imagine applying the Many() combinator to it: A.Many(). The A parser will succeed on "", but it won't consume any input, so when Many()triesAagain it will succeed again - and so on, indefinitely. Therein lies the road to madness... so to prevent this sticky situation (which can otherwise get very hard to detect),Many()disallows zero-length matches, and most Superpower parses shouldn't allow them. What it means if you're hitting this: you have a parser that succeeds without consuming input - somewhere. Best to rewrite so that the parser fails on no match, and then useOptionalOrDefault()` to handle the empty case.

How to decide on the boundaries of tokens - in general, tokens should be as inclusive as the grammar allows. If //, when it appears in the grammar, can only mark the start of a comment, and you never "look inside" the comment, then the token should begin at // and continue to the end of the comment. If // can appear in situations where it's not marking a comment, or, if your grammar needs to split the comment contents out from the starting delimiter, treating // as one content and the comment body as another will be needed. There are really no hard-and-fast rules, though.

Hope this helps,
Nick

nitz commented

Nick,

Wow, thanks for the response and so quickly โ€” this certainly does clear a lot up!

Especially on the token boundaries. The "look inside" guideline seems to be something I was close to, but wasn't quite getting. Seeing your explanation though helps a ton with that. For instance, I'm planning on not only the traditional C++ style // comments that I'd never need to look inside and would treat as a single chunk. I'm also planning on a /- comment style (one of the features I want to borrow from KDL, a fork of SDL). I still want to process all the text after that comment, but to deserialize it in a specific way that says "this content exists, but isn't part of the regular data stream", so it makes sense for /- to be a token.

On zero length matches, your response makes perfect sense. And I think is definitely what I'm seeing happening given my input. I think my goal is exactly as you say there, write it it so it fails on no match, I'm just a bit confused on how to fail. I have definitely gotten too "permissive" in my approach, but I think that comes from not being exactly sure where will trigger a failure, and assuming each of my rules will run its course even on bad input.

I tried selecting a MyTokenListParserResult.Empty<...>() at one point, but found it wanted the remainder to return with, and couldn't figure out how to give it that. Is that the usual method for saying "I didn't parse anything"? Or should my rules be trying to consume what they expect and that will properly trigger the fail cases?

Thanks again!

Chris

nitz commented

Also, well, you got me there!

A slight tweak to one of my latest Node parsers and a few removals of Try() where I shouldn't have been, and I'm off to the races solving problems with my text parsers!

Here's what cleared up the zero-length parses, if I'm understanding right: requiring at least one argument if there's no identifier (meaning there should be no case where a node gets returned with no name and no arguments!)

		private static TokenListParser<Token, Element> Node { get; } =
			(from element in Identifier
				.Then(name =>
					Argument.Many().OptionalOrDefault(Array.Empty<Element>())
					.Named("arguments")
					.Select(x => new Node(name, x) as Element)
				)
				.Named("node name")
				.Or(
					Argument.AtLeastOnce()
					.Named("arguments")
					.Select(x => new Node(string.Empty, x) as Element)
				)
			 select element)
			.Named(nameof(Node));

You're fine to close this one if you want, I'm only leaving it open at the moment just in case you have anything to add about another way to force a failure. ๐Ÿ™‚

Hope it's going well! Will close this to keep the clutter under control in the issue tracker :-)