lezer-parser/lezer

Grammar railroad diagram

mingodad opened this issue · 6 comments

Manually converting the lezer grammar to an EBNF understood by https://www.bottlecaps.de/rr/ui (that can be improved/fixed some deviations) we can have a nice railroad diagram (https://en.wikipedia.org/wiki/Syntax_diagram).

Ideally the lezer generator would have an option to output EBNF grammar for debugging and manipulation with other tools like I did here https://github.com/mingodad/lalr-parser-test for bison/byacc/lemon.

Copy and paste the EBNF shown bellow at https://www.bottlecaps.de/rr/ui on the tab Edit Grammar then click on the tab View Diagram.

declaration ::=
	RuleDeclaration+
	| PrecedenceDeclaration
	| TokensDeclaration
	| ExternalTokensDeclaration
	| ExternalPropDeclaration
	| ExternalSpecializeDeclaration
	| ContextDeclaration
	| DialectsDeclaration
	| TopSkipDeclaration
	| SkipScope
	| DetectDelimDeclaration

RuleDeclaration ::=
	"@top" RuleName Props? ParamList? Body

PrecedenceDeclaration ::=
	"@precedence" PrecedenceBody

PrecedenceBody ::=
	"{" (Precedence  ","?)* "}"

Precedence ::=
	PrecedenceName ("@left" | "@right" | "@cut")?

TokensDeclaration ::=
	"@tokens" TokensBody

TokensBody ::=
	"{" tokenDeclaration* "}"

ExternalTokensDeclaration ::=
	"@external" "tokens" Name "from" Literal externalTokenSet

ExternalPropDeclaration ::=
	"@external" "prop" Name ("as" Name)? "from" Literal

ExternalSpecializeDeclaration ::=
	"@external" ("extend" | "specialize") Body Name "from" Literal externalTokenSet

ContextDeclaration ::=
	"@context" Name "from" Literal

DialectsDeclaration ::=
	"@dialects" DialectBody

DialectBody ::=
	"{" (Name ","?)* "}"

TopSkipDeclaration ::=
	"@skip" Body

SkipScope ::=
	"@skip" Body /*!scopedSkip*/ SkipBody

SkipBody ::=
	"{" RuleDeclaration* "}"

DetectDelimDeclaration ::= "@detectDelim"

externalTokenSet ::=
	"{" (Token ","?)* "}"

Token ::=
	RuleName Props?

tokenDeclaration ::=
	TokenPrecedenceDeclaration
	| TokenConflictDeclaration
	| LiteralTokenDeclaration
	| RuleDeclaration

TokenPrecedenceDeclaration ::=
	"@precedence" PrecedenceBody

PrecedenceBody ::=
	"{" ((Literal | nameExpression) ","?)* "}"

TokenConflictDeclaration ::=
	"@conflict" ConflictBody

ConflictBody ::=
	"{" (Literal | nameExpression) ","? (Literal | nameExpression) "}"

LiteralTokenDeclaration ::=
		Literal Props?

RuleDeclaration ::=
	RuleName Props? ParamList? Body

ParamList ::=
	"<" (Name ("," Name)*)? ">"

Body ::=
	"{" expression? "}"

Props ::=
	"[" ((Prop ",")* Prop)? "]"

Prop ::=
	(AtName | Name) ("=" (Literal | Name | "." | PropEsc)*)?

PropEsc ::=
	"{" RuleName "}"

expression ::=
	seqExpression
	| Choice

Choice ::=
	seqExpression? ("|" seqExpression?)+

seqExpression ::=
	atomExpression
	| Sequence

Sequence ::=
		marker (atomExpression | marker)*
		| atomExpression (atomExpression | marker)+

atomExpression ::=
	Literal
	| CharSet
	| AnyChar
	| InvertedCharSet
	| nameExpression
	| Optional
	| Repeat
	| Repeat1
	| InlineRule
	| ParenExpression
	| Specialization

Optional ::=
	atomExpression /*!repeat*/ "?"

Repeat ::=
	atomExpression /*!repeat*/ "*"

Repeat1 ::=
	atomExpression /*!repeat*/ "+"

InlineRule ::=
	(RuleName /*!inline*/ Props? | Props) Body

ParenExpression ::=
	"(" expression? ")"

Specialization ::=
	("@specialize" | "@extend") Props? ArgList

nameExpression ::=
	RuleName
	| ScopedName
	| Call

Call ::=
	(RuleName | ScopedName) /*!call*/ ArgList

marker ::=
	PrecedenceMarker
	| AmbiguityMarker

PrecedenceMarker ::=
	"!" PrecedenceName

AmbiguityMarker ::=
	"~" Name

ScopedName ::=
	RuleName /*!namespace*/ "." RuleName

ArgList ::=
	"<" (expression ("," expression)*)? ">"


RuleName ::=
	name

PrecedenceName ::=
	name

Name ::=
	name


//@tokens ::=
whitespace ::= std.whitespace+
LineComment ::= "//" [^\n]*
BlockComment ::= "/*" blockCommentRest
blockCommentRest ::= [^*] blockCommentRest | "*" blockCommentAfterStar
blockCommentAfterStar ::= "/" | "*" blockCommentAfterStar | [^/*] blockCommentRest
name ::= (std.asciiLetter | std.digit | $[\-_\u{a1}-\u{10ffff}])+
AnyChar ::= "_"
//@precedence { AnyChar, whitespace, name }
keyword ::= name
//@precedence { whitespace, keyword }
AtName ::= "@" name
Literal ::=
	'"' ([^\\\n"] | "\\" _)* '"'?
	| "'" ([^\\\n'] | "\\" _)* "'"?

CharSet ::= "$[" ([^\\#x1D] | "\\" '_')* "]"
InvertedCharSet ::= "![" ([^\\#x1D] | "\\" '_')* "]"
/*
@precedence { InvertedCharSet, "!" }
  "{" "}" "(" ")" "[" "]"
  "=" "." "|" "!" "~" "*" "+" "?"
*/

//@detectDelim

Ideally the lezer generator would have an option to output EBNF grammar

I'm okay with adding an option to emit the parse tree of the grammar file as JSON, to make it easy to build tools like this externally, but I don't want to increase the scope of the tool by adding something like this.

It's definitely an interesting option.

Trying to create a LL(1) parser for lezer grammar (https://github.com/lezer-parser/lezer-grammar/blob/main/src/lezer.grammar) I found that there is some embedded rules with the same name but distinct body (like PrecedenceBody shown bellow) is this expected ?

PrecedenceDeclaration {
    at<"@precedence"> PrecedenceBody {
      "{" (Precedence { PrecedenceName (at<"@left"> | at<"@right"> | at<"@cut">)? } ","?)* "}"
    }
  }
...
TokenPrecedenceDeclaration {
    at<"@precedence"> PrecedenceBody { "{" ((Literal | nameExpression) ","?)* "}" }
  }

Yes, that's expected.

Could you tell me which rule match this line of your javascript grammar:

ExportGroup {
  "{" commaSep<VariableName (ckw<"as"> (VariableName { word } | String))?> "}" ///!!! <<< here after VariableName
}

Looking through the lezer grammar it seems that the rule that could match the above shown line is one of this:

...
ParamList { "<" (Name ("," Name)*)? ">" }
...
ArgList {
  "<" (expression ("," expression)*)? ">"
}
...

I'll say ArgList but on both of then should be a comma (,) separating more than one element but I don't see one.

As I said before I'm trying to create a LL(1) parser for the lezer grammar and my parser got stuck here commaSep<VariableName waiting for a comma (,).

Am I missing something here ?

Can a generated parser from the lezer grammar parse the javascript grammar ?

ParamList is used for parameterized rule definitions, ArgList for argument lists to rules. So I guess ArgList in this case. Multiple arguments are separated by commas, but a sequence of expressions has its usual meaning as a single argument.

And yes, @lezer/lezer parses the other parsers as part of its test suite.