jeffreykegler/libmarpa

A way to reject partial parses

dabrahams opened this issue · 22 comments

Some rules in my grammar have constraints that there should be no whitespace between adjacent elements (some also have the constraint that there should be no newlines, but we can ignore that for now). Is there a good way to handle that sort of thing with Marpa? If there was a way to reject the completion of certain rules during parsing that would be simple.

The first alternative I see is to explicitly represent the possibility of whitespace in the grammar, but to do that without introducing huge ambiguity seems to mean I have to do the Aycock & Horspool nihilist normal form (NNF) transformation on my raw grammar, so I don't introduce whitespace recognition on both sides of a nulling symbol. Since Marpa is surely already doing that analysis, it feels wasteful to implement it myself.

The second alternative is to ignore the whitespace constraints until the parse forest is complete and then reject any complete parses that violate the constraints, but that seems inefficient, since many parses may be explored that could have been rejected earlier, and there's no way, once a constraint violation is detected in a complete parse, to avoid checking other complete parses that contain the same partial parse that violated the constraint.

In the lexer you could define special whitespace-preceded lexemes. When these are recognized they can be treated as you ish, for example as an error. Marpa::R2 has "lexeme events" and that kind of mechanism might be helpful.

I noticed the attempt at explicit representation of whitespace in the grammar for issue #116. This could also be a solution, and may not need to be ambiguous. I experimented with this, and my experience suggested that

  1. all the top rules deal with only inter-symbol whitespace, delegating initial and final whitespace to parent rules, and
  2. that there be a top rule to dealing with any initial and/or final whitespace allowed for the full parse.

I gave up on this in favor of the "discard" lexemes implemented for Marpa::R2, because getting it all right was like being pecked to death by ducks. Nonetheless, such a scheme does lend itself to the kind of locally non-orthogonal treatments of whitespace you have in mind.

For context, the EBNF source of my grammar (which I'm converting mechanically into MARPA rules) is in the ebnf markdown blocks here—search for the string "(no-" to find the special rules that don't implicitly allow whitespace.

Regarding the whitespace-preceded lexemes idea:

  1. I don't understand “when these are recognized they can be treated as you ish, for example as an error.” I certainly don't want to cause an error when they are recognized by the lexical analyzer. Where and how do you propose I treat them as an error?

  2. In most places I need to accept the lexemes whether or not they're preceded by whitespace, so I'd end up defining nonterminals like

     colon -> ws-colon
     colon -> no-ws-colon
    

    and using “colon” almost everywhere. Not the end of the world, but not great either.

  3. If you look at the places where these constraints are expressed, it's not in terms of lexemes at all, and it's not in terms of what-precedes, but what's-between. Automatically rewriting the rules in terms of these ws-prefixed lexemes is as much work as the NNF transformation I'd have to do to avoid ambiguity with my current strategy, and I've already coded the NNF transformation once—I don't even know what the rewrite for your suggestion would look like.

With regard to explicit representation of whitespace: no, it needn't be ambiguous, but the naïve approach I currently have of interjecting optional space tokens into the rules between every symbol does cause a lot of ambiguity in fact. Any nullable symbol surrounded by a pair of "any amount of whitespace" (AAoW) symbols will cause whitespace to be parsed ambiguously in that position, as either part of the first AAoW symbol or the second. That's why I would need the NNF transformation; to avoid adding these AAoW symbols around nullable symbols.

I don't think I understand what your experience suggested. Is there an automated transformation for it that I can apply to my specification?

I did my work on explicit whitespace as research, possibly toward an automated transformation, but I abandoned the idea in favor of the "discard" lexeme approach.

My idea was, for those lexemes which do not allow preceding whitespace, to define "error lexemes", which consist of that lexeme preceded by whitespace. If you see one of these, you know someone has put whitespace in the wrong spot and handle it accordingly. You only need define such "error lexemes' for the cases where whitespace is, indeed, an error.

In correct input, you won't see any "error lexemes".

Sure… but I don't have any lexemes that are defined to not allow preceding whitespace, and it's not clear to me that the grammar I am encoding can be represented in those terms.

Re "I don't have any lexemes that are defined to not allow preceding whitespace" in #117 (comment), my comments were attempting to be responsive to this from #117 (comment):

Some rules in my grammar have constraints that there should be no whitespace between adjacent elements

I understand that. What I'm saying is that the whitespace restrictions in my grammar rules are expressed in terms of the LHS nonterminal being recognized, not in terms of the participating lexemes. You could of course rewrite the grammar to attach those restrictions to lexemes, but that would be a massive rewrite that I don't know how to do mechanically.

At this point it is starting to look like I should switch to my own implementation of the algorithms, because I can easily add the necessary functionality.

Any objection to closing this issue?

🤷 I don't think my problem's been addressed; in case it wasn't clear, I think the fact that I don't have a solution means I'm going to stop using MARPA. The only reason I can see for leaving the issue open would be if you intended to do something about it, but it sounds as though you don't. If I've understood correctly, closing it would be appropriate.

Tomorrow I'll look this issue over and try to see what I have missed. Is there a write-up somewhere of your requirement or something similar?

The decision which parser to use belongs to the users, and I try to take from these choices what I can for Marpa, and otherwise to respect the user's choice and not to take it personally.

In case I don't get another opportunity, I want to thank you for all these issues. They include insightful editing suggestions and the detection of an obscure but serious bug. It will take me weeks to work through all these. The API doc will be significantly better. Marpa will be significantly improved by the time you've spent here.

You're welcome, and thank you for the work you've done here. If it turns out I switch implementations I hope I can keep asking you questions? If so, what's the best medium?

Best to communicate via a Github issue where it's a specific issue on a repo, and via the IRC channel otherwise.

One reason I may have seemed reticent is that I am avoiding the issue of "futures". I do have my own ideas for new directions, which may include things like undoing bits of parsing already done. None of these are things you'll see in the next few weeks.

I mean questions regarding what MARPA does internally, and why, rather than anything about its future direction.
I'd particularly like an answer to the first question in this comment.

I took a closer look at the Val doc. What does the (no-whitespace) signify in the following?

module-definition ::= (no-whitespace)
  whitespace-opt module-body whitespace-opt

It's an EBNF notation I have not seen before. Naively, it seems to say "no whitespace, but yes optional whitespace at the beginning".

Is it a meta-notation saying that the lines are not tokenized, but neither is the lexical whitespace in the text to be taken literally, instead it is indicated via symbols like whitespace-opt?

@dabrahams OK. I looked over the Val doc.

Re the "go ambiguous then prune" approaches. I've done some work in creating tools for these, but the IF is not documented in Libmarpa, although it is in Marpa::R2 (as the ASF interface) for Perl users. I'm not enthusiastic about its chances because my work with general parsing has given me an enormous respect for exponential growth --- creating an ambiguous ASF, then pruning, is likely to be expensive. I'll say more if you want, but here I'll move on to an approach that does not exploit ambiguity.

Reading the Val doc, I see its EBNF as having two kinds of rules -- tokenized and explicit (the (no-whitespace) rules). That EBNF could be translated for Marpa by making every rule "explicit". This might be done automatically by inserting optional whitespace between every symbol in the tokenized rules. Some notes:

1.) To reduce ambiguity, non-medial (initial and final) whitespace should be avoided in the explicit rules. I'd be tempted to outright ban non-medial whitespace in explicit rules, but introduce a (raw) meta-notation to override the ban for those users who believe that they know what they are doing.

2.) The top rule obviously requires some sort of special treatment, because initial and final whitespace is necessary to express many useful grammars, and no ambiguities arise.

3.) Then there is the "weeknights" problem. How to avoid the ambiguity with reading it as the two identifiers "wee knights"? The longest acceptable token method, which always goes for "weeknights" is the naive approach, is native to Marpa::R2, and will cover almost all cases. But there may be corner cases, and useful apps for which hacks are needed.

This is like the approach I was investigating when I decided to go with the "discard symbols" alternative. I did this because:

1.) I thought the "discard" approach would be easier to explain. I didn't have the idea of mixing explicit and tokenized rules, which does make things more natural on the notation side.

2.) I thought working out all the corner cases, and providing workarounds, might be quite the project, possibly aggravating 1).

3.) I was nearly 100% sure the discard approach would work -- it was very closely related to what lexers had been doing for decades. The explicit whitespace approach had fewer parallels in practical experience. Those who pioneer unsuccessful approaches in a sense contribute as much as those who try the successful approaches, but I was not eager to be one of the former.

I hope you find this useful.

ASF = Abstract Syntax Forest. This is an alternative evaluation interface developed under Marpa::R2 which uses the bocage to directly traverse the forest of parses. The standard valuator allows parses to be iterated one by one, but if you want to deal with all of an exponential number, that's often not adequate. The ASF allows you to prune the forest, elimnating potentially huge numbers of possibilities with each step.

To implements its ASF, Marpa::R2 violates the Libmarpa API, using Libmarpa calls which are currently not documented.

This discussion meandered quite a bit, and has gone off line. @dabrahams: Do you have any objection to my closing this?

none

Absent feedback to the contrary, and as per #117 (comment), I will close this issue after a couple of days.