html5lib/html5lib-tests

Better document the grammar of the test format

gsnedders opened this issue · 1 comments

We sorta document it as a tree-construction specific format. We should give the generic definition most people use.

It's something like the state machine below:

HEADER: "#([^\n]+)"
BODY: "([^#][^\n]*)"
LF: "\n"

start:
  HEADER -> after_header

after_header:
  LF -> body
  EOF

body:
  LF -> after_lf
  BODY -> body
  EOF

after_lf:
  LF -> after_lf_lf
  BODY -> body
  EOF

after_lf_lf:
  LF -> after_lf_lf {add "\n" to body}
  BODY -> body {add "\n\n" to body}
  HEADER -> after_header
  EOF {add "\n" to body}

This shouldn't be much effort to convert into an LR(2) grammar; the LR(1) grammar equivalent may be hell.

Basically what we want, for the terminals above, is:

test = HEADER LF (BODY | LF)*
tests = test (LF LF test)* LF

Which is LR(2).