/slowparse

A slow JS-based HTML parser with good error feedback and debugging metadata.

Primary LanguageJavaScriptMozilla Public License 2.0MPL-2.0

Slowparse, a friendlier HTML5 parser

Slowparse is an experimental JavaScript-based HTML5 parser born out of Mozilla Webmaking initiatives. A live demo of Slowparse can be found over at http://mozilla.github.io/slowparse

Installing Slowparse

The Slowparse library can be used both in the browser and in environments that support commonjs requirements such as Node.js, by respectively including it as a script resource:

<script src="slowparse.js"></script>

or as module import, by installing it using npm:

$> npm install slowparse

After installing, Slowparse can then be required into your code like any other module:

var Slowparse = require("slowparse");

Using Slowparse

To use Slowparse, call its .HTML function:

var result = Slowparse.HTML(document, '... html source here ...', options);

This function takes a DOM context as first argument, and HTML5 source code as second argument. The options object is optional, and if used can contain:

options.errorDetectors

This is an array of "additional parsers" that will be called as 'detector(html, domBuilder.fragment)` when no errors are found by Slowparse. These can be useful when you have additional constraints on what HTML source is permitted in your own software that cannot or should not be dealt with by Slowparse itself.

This is mostly a convenience construction, and using it is equivalent to doing an if (!result.error) test and running the input through your own, additional parsers if no errors we found.

options.disallowActiveAttributes

This option can be either true or false, and when true will blank out attributes when it sees any that start with on such as onclick, onload, etc.

This means the DOM formed during the Slowparse run is a tiny bit more secure, although you will still be responsible for checking for potentially harmful active content (Slowparse is not a security tool, and should not be used as such).

Validating HTML

Slowparse accepts both full HTML5 documents (starting at <!doctype html> and ending in </html>) as well as well formatted HTML5 fragments. Any input that does not pass HTML5 validation leads to a result output with an error property:

var result = Slowparse.HTML(document, '<a href+></a>');
console.log(result.error);
/*
  {
    type: 'INVALID_ATTR_NAME',
    start: 3,
    end: 8,
    attribute: { name: { value: "+" }},
    cursor: 3
  };
*/

There are a large number of errors that Slowparse can generate in order to indicate not just that a validation error occurred, but also what kind of error it was. The full list of reportable errors can currently be found in the ParseErrorBuilders.js file.

Using validated HTML

If Slowparse yields a result without an .error property, the input HTML is considered valid HTML5 code, and can be injected into whatever context you need it injected into.

var input = "...";

var result = Slowparse.HTML(document, input);

if (!result.error) {
  activeContext.inject(input);
} else {
  notifyUserOfError(result.error);
}

Note that Slowparse generates an internal DOM for validation that can be tapped into, as result.document. If no options object with the disallowActiveAttributes is passed during parsing, this DOM should be identical to the one built by simply injecting your source code. If disallowActiveAttributes:true is used, this DOM will be the same as the one built by the browser, with the exception of on... attributes, which will have been forced empty to prevent certain immediate script actions from kicking in.

Getting friendlier error messages

By default, Slowparse generates error objects. However, if you prefer human-readable error messages, the ./locale/ directory contains a file en_US.json that consists of English (US) localized error snippets. These are bits of HTML5 with templating variables that can be instantiated with the corresponding error object.

For example, if you are getting a MISSING_CSS_BLOCK_CLOSER error, the local file specifies the following human-friendly error:

<p>Missing block closer or next property:value; pair following
<em data-highlight='[[cssValue.start]],[[cssValue.end]]'>[[cssValue.value]]</em>.</p>

We can replace [[cssValue.start]] with Slowparse's result.error.cssValue.start and [[cssValue.end]] with result.error.cssValue.end, and the same for cssValue.value, to generate a functional error. For instance, if there is an error in a CSS block after a property background:white, with "white" on the 24th character in the stream, the error might resolve as:

<p>Missing block closer or next property:value; pair following
<em data-highlight='24,29'>white</em>.</p>

Note that Slowparse has no built in mechanism for generating these errors, but only provides you with the error objects as a result from parsing, and the locale file for resolving error objects to uninstantiated human readable HTML snippets.

Working on Slowparse

The slowparse code is split up into modules, located in the ./src directory, which are aggregated by ./src/index.js for constructing the slowparse library. This construction is handled by browserify, and runs every time the npm test command is run, yielding a rebuilt slowparse.js.

If you wish to help out on Slowparse, we try to keep Slowparse test-driven, so if you have bad code that is being parsed incorrectly, create a new test case in the ./test/test-slowparse.js file. To see how tests work, simply open that file and have a look at the various tests already in place. Generally all you need to do is copy-paste a test case that's similar to what you're testing, and changing the description, input HTML, and test summary for pass/fail results.

Passing all tests is the basic prerequisite to a patch for Slowparse landing, so make sure your code comes with tests and all of them pass =)