html5lib/html5lib-tests

New and old parse errors

stevecheckoway opened this issue · 6 comments

I'm trying to test that an HTML parser produces the correct errors but some of the tests list #errors and #new-errors where the new errors seem to be the new standard names for old errors. For example,

#data
<?COM--MENT?>
#errors
(1,1): expected-tag-name-but-got-question-mark
(1,13): expected-doctype-but-got-eof
#new-errors
(1:2) unexpected-question-mark-instead-of-tag-name
#document
| <!-- ?COM--MENT? -->
| <html>
|   <head>
|   <body>

I'm not sure why the first error is there, but I assume an older version of the standard had something like a look ahead on the <. As I read the standard, there should be an unexpected-question-mark-instead-of-tag-name and a (currently unnamed in the standard) parse error for the missing DOCTYPE.

Is it always the case that if there's a new error that it replaces exactly one of the old errors? If not, how should these be handled?

Looking at #92 again, which introduced these:

Note that for now we've decided to move new error codes to a separate section in tree construction stage tests to not mix things up. Once we have a spec for tree construction stage errors, we'll remove old errors and move new errors to #errors section.

So I think in principle #new-errors should be complete for tokenizer errors, and the number of lines in #errors be the number of total parse errors (tokenizer+tree construction)?

@gsnedders great, thanks! That eliminates about 250 test failures for me, 120 to go!

@gsnedders Is #113 the right approach here? Two of the errors aren't errors (any longer?) and it looks like one is now but wasn't before.

It's also possible that the error logic didn't actually flip as that PR would indicate and instead the test was just wrong before.

@stevecheckoway FWIW the error data is by far the most likely to be wrong bit of data in the testsuite, because very few implementations use it, so it's entirely plausible that the test was just wrong

@gsnedders Makes sense. Without standardized error names (at least until now), it's pretty tricky to test that the errors are correct. Number of errors is a pretty weak proxy so I can see why people wouldn't bother.

For what it's worth, those eight PRs resolve about half of the 120 test failures from different numbers of errors I'm getting when running against the #script-off tests with Nokogumbo. (Which isn't to say the remaining errors are entirely in the testsuite, I've caught bugs in my code, which of course, is the whole purpose of my testing.)

Any updates on updating the errors in the test to reflect the spec?

The tree-construction stage still doesn't have its own error codes, unfortunately, but it'd be nice for the tests to at least have the correct number of errors.