mangalam-research/salve

HTML5 Validation

Closed this issue · 3 comments

hi @lddubeau,

does salve allow to validate html files against the official w3c html5 .rng files?

If I read it correct salve does not support parsing the html file - is there any recommended parser which works with salve out of the box?

It should be fine, with some caveats:

  1. salve-convert is used to convert the schema to something salve understands internally. The input must be a Relax NG schema in XML (.rng), rather than the compact notation.(.rnc). I see all the files there are in the compact notation. However, since there's a 1 for 1 equivalence between the Relax NG in XML and the compact notation, it should just be a matter of using a tool like trang to convert the .rnc files to .rng files.

  2. As the comment at the top of html5.rnc states, the HTML needs to be converted to XML first. This must be also taken care of when trying to validate with salve. It may be possible to avoid doing an actual conversion of the HTML to XML. Instead, it is conceivable that whatever parses the HTML could just emit events as if the HTML were XML. It would make the HTML look like XML as far as salve is concerned. For instance, upon encountering the HTML <input ...> it could emit enterStartTag, leaveStartTag, the events for the attributes and immediately emit endTag to close the element. This would effectively replicate the sequence of event that would be emitted for the XML equivalent <input .../>.


As far as recommended parsers go, I've used sax for the test suite. The main examples of its use are:

If you happen to have HTML/XML in a DOM tree you can also walk the tree and emit appropriate events. That's how it is done for wed.

Wow cool thanks for your quick response and all the insights 👍

I managed to convert the .rng file using rng2srng

java -jar rng2srng.jar -c validator/schema/html5/html5.rnc > html5.rng

As a parser I would like to use https://github.com/fb55/htmlparser2 as it is already used by https://github.com/htmllint/htmllint but I have to find out if it is able to parse <input> as <input/> out of the box.

I'm going to close this, but questions may still be posted if necessary.

Two things that have changed since my initial answer:

  1. For people who want to validate XML documents represented as DOM trees there's salve-dom.

  2. Salve 6.0.0 no longer requires that schemas be processed ahead of time with the command line tool salve-convert. You can load salve, and call convertRNGToPattern to convert the Relax NG schema in XML form (salve still does not read the compact form directly) into salve's internal form and then use the result to get a walker on which to fire events.