dryruby/ebnf

invalid range syntax errors in some circumstances

Closed this issue · 6 comments

Description
I get SyntaxError in some rules as simple as letter ::= [A-Z]. It seems to be caused by another rule being just after it, but it does not seem that simple. I attach a program to show the cases I checked.
The error produced is like this:

$ irb --nocolorize --simple-prompt -r ebnf
>> EBNF.parse("letter ::= [A]\nx ::= 'X'")
Traceback (most recent call last):
       10: from /usr/local/bin/irb:23:in `<main>'
        9: from /usr/local/bin/irb:23:in `load'
        8: from /var/lib/gems/2.7.0/gems/irb-1.2.7/exe/irb:11:in `<top (required)>'
        7: from (irb):1
        6: from /var/lib/gems/2.7.0/gems/ebnf-2.1.1/lib/ebnf.rb:26:in `parse'
        5: from /var/lib/gems/2.7.0/gems/ebnf-2.1.1/lib/ebnf.rb:26:in `new'
        4: from /var/lib/gems/2.7.0/gems/ebnf-2.1.1/lib/ebnf/base.rb:124:in `initialize'
        3: from /var/lib/gems/2.7.0/gems/ebnf-2.1.1/lib/ebnf/base.rb:124:in `new'
        2: from /var/lib/gems/2.7.0/gems/ebnf-2.1.1/lib/ebnf/parser.rb:265:in `initialize'
        1: from /var/lib/gems/2.7.0/gems/ebnf-2.1.1/lib/ebnf/parser.rb:302:in `rescue in initialize'
SyntaxError (ERROR [line: 1] syntax error, expecting :HEX, :SYMBOL, :O_RANGE, :RANGE, :STRING1, :STRING2, "(", :primary, :postfix, :seq, :alt, :expression (found "[A]\nx ::= 'X'"))
>> 

To reproduce
Run the attached ebnf_debug.rb.txt to see examples and counterexamples.
§ denotes newline as you can check in the program.
For me it prints:

$ ruby ebnf_debug.rb 
Ruby: 2.7.0,  ebnf: 2.1.1
letter ::= [A-Z]                          : OK
letter ::= [A]                            : OK
digit ::= [0-9]                           : OK
letter ::= [A-Z] §                        : OK
letter ::= [A-Y] | 'Z'                    : OK
word ::= [A-Z]+                           : OK
letter ::= [A-Z] /* x */                  : OK
letter ::= [A-Z]       § digit ::= [0-9]  : SyntaxError
digit ::= [0-9]        § letter ::= [A-Z] : SyntaxError
letter ::= [A-Z]       § word ::= letter+ : SyntaxError
letter ::= [A-Z]       § x ::= 'X'        : SyntaxError
letter ::= [A]         § x ::= 'X'        : SyntaxError
letter ::= [A] | 'B'   § x ::= 'X'        : OK
letter ::= 'B' | [A]   § x ::= 'X'        : SyntaxError
letter ::= [A] | [B]   § x ::= 'X'        : SyntaxError
letter ::= [A] x       § x ::= 'X'        : OK
letter ::= [A-Z] x     § x ::= 'X'        : OK
letter ::= x [A]       § x ::= 'X'        : OK
letter ::= x [A-Z]     § x ::= 'X'        : SyntaxError
letter ::= [A] /* x */ § x ::= 'X'        : OK
letter ::= [A-Z] /* */ § x ::= 'X'        : OK
letter ::= [^A]        § x ::= 'X'        : OK
                                          : OK

Expected behavior
A successful parse of each above indicated example.

Desktop (please complete the following information):

  • OS: Ubuntu 20.04.1 LTS
  • Browser: (?) not applicable

This comes from a conflict in the EBNF grammar between LHS (of rule) and RANGE.

[11] LHS        ::= ('[' SYMBOL ']' ' '+)? SYMBOL ' '* '::='
[14] RANGE      ::= '[' ((R_CHAR '-' R_CHAR) | (HEX '-' HEX) | R_CHAR | HEX)+ '-'? ']' - LHS

In spit of the - LHS it can still get confused.

Many (most) EBNF grammars use the [1] identifier before the rule name, but it is not officially part of the grammar, so the LHS terminal tries to accommodate this.

I think the fix is to branch the grammar depending on if it starts with '[' SYMBOL ']' or not, for example:

[1] ebnf ::= ((rulei (rulei | declaration)*) | (rules (rules | declaration)*))?

[3i] rulei ::= IDENTIFIER SYMBOL '::='

[3s] rules ::= SYMBOL '::='

That way, it should avoid the ambiguity.

As a workaround, before I get around to an update, you can try adding comments after the rule, or introducing the [xx] identifier prefix before the rule. You might also have some success using --input-format native, which uses a custom parser that has it's own issues, but may work for your particular grammar.

OK, thanks, I get it... I am new to this whole topic, and I was only looking at https://www.w3.org/TR/REC-xml/#sec-notation which does not mention the identifier. Maybe that is a point to clarify in the documentation?

And it seems that in the format recognized by this gem, all whitespace are created equal (is it preprocessed?), and this makes a newline acceptable also within the LHS.

I think another pretty unobtrusive way to remove this ambiguity (aside from your suggestion) could be requiring a line break before the LHS but not allowing a line break within the LHS - the latter seems already the case based on your comment, so I assume it is only applied after whitespace preprocessing. This way the rule identifier would not need to be uniformly used or unused, one could decide on a rule by rule basis.

The XML spec is the only citable reference for W3C EBNF, but many specs have taken liberties with the syntax. I document what I believe is colloquial EBNF in the README, this includes the use of [xxx] to identify productions by number, but also the merger of RANGE and ENUM, as is common, along with the @pass and @terminals declarations. I had a mind to create a CG to try to lead to a normative description of W3C EBNF, but haven't gotten around to it.

From the XML spec, whitespace is discarded between non-terminal productions, and newlines have no particular meaning. Within a terminal, you can include whitespace, and a terminal defined as an expression of other terminals has no implicit whitespace. In the existing grammar, LHS is a terminal, and doesn't allow whitespace, but it still leads to the ambiguities.

The update I'm working on separates ID from LHS, and would allow whitespace (including newline) between them, but it would branch depending on if the first non-terminal encountered is ID ([xxx]) or not. But, updating the grammar in the gem is non-trivial.

I did work on this previously, but wasn't able to come up with a consistent grammar which was unambiguous; going into the details of that would be challenging to reconstruct now, but if you turn on debugging, you can sort of figure it out.

As I said, in the mean time, there are some workarounds you might try.

My mistake, now I remember skimming through that part of the README, but later kept reading the W3C page instead.
Now that the cause is clear, the workarounds are clear and easy too 8-)

Documented, as there is no good way to make an unambiguous grammar having both RANGE and rule identifiers, but it can be worked around.