html5lib/html5lib-tests

DOCTYPE parse errors missing from various tree construction tests

nolanw opened this issue · 4 comments

Many tree construction tests should, but don't, include parse errors for absent or invalid <!DOCTYPE> elements. cite

For an example of each, every test in tree-construction/adoption01.dat should include a parse error for the missing <!DOCTYPE>. And the fourth test in tree-construction/doctype01.dat should include a parse error for the invalid <!DOCTYPE>.

If there's agreement that the tests should include these parse errors, I'll go ahead and open a pull request from this issue. In that case, is there some kind of format for the errors that I should adhere to? I'm unable to divine the format from examples.

It's rather known that the set of parse errors in most tests doesn't match what the spec says — nobody has ever cared about them that much (and I believe most still don't, though @hsivonen and @abarth may say otherwise, given they now report them for dev tools?) — though obviously it'd be nice to have them matching the spec (equally, when most tests were written what was and what wasn't a parse error still changed semi-regularly), and as such patches are more than welcome.

The format is basically one line (containing something!) per parse error. It's hard to require any specific format given the spec places no requirements on ordering (e.g., html5lib/html5lib-python throws all the parse errors from the input stream (i.e., permanent noncharacters) when reading the chunk), so practically they're just descriptive. Try to include line/column number and some sort of message that vaguely describes the parse error!

(Unrelated: I presume you're working on some implementation of the parser? In what language? Code anywhere?)

That's actually a relief! I've been combing through my code looking for spurious parse errors, on the assumption that my code was wrong. Which it probably still is, but I'll take a more critical look at the tests now.

I am indeed writing a parser, in Objective-C. No code available yet, but I'll come back when that changes.

Heh, yeah. Don't trust totally anything but the trees given in the tree construction tests. Nothing else is as actively kept up to date with the spec.

When it comes to format, I'd say to go for (line,col): message where line is 1-based, col is 0-based, message is what html5lib-python gives. But there's currently no rule, so it doesn't matter much.

I finally stuck my code online if you're curious. Thanks again for all these tests! Between them and a well-written spec, things went way faster than they otherwise would have.