/marked

Parsing, filtering, selecting and serializing HTML/XML markup (rust)

Primary LanguageRustApache License 2.0Apache-2.0

The Märkəd Project

deps status CI Status

A rust language project for parsing, filtering, selecting and serializing HTML and XML mark-up.

See the marked crate or marked-cli crates or the README(s) and CHANGELOG(s) under this (github hosted) source tree and cargo workspace.

Feature Overview

Currently implemented features:

A vector-allocated, indexed, DOM-like tree structure

The marked::Document is a DOM-like tree structure suitable for HTML and XML. This was forked from the victor project (same author as html5ever) and further optimized. It is implemented as a (std) Vec of Node types, which references parent, siblings and children via (std) NonZeroU32 indexes for space efficiency.

html5ever integration

Including HTML5 document and fragment parsing and HTML5 serialization (mark-up output). With the marked::Document (DOM), parsing and serialization is measurably faster (see benchmarks in source tree) than the RcDom previously included with html5ever associated crates, and mutating the Document is more straightforward, via a mutable reference.

xml-rs integration

Strict, UTF-8 XML parsing to marked::Document is currently supported by integration of the xml-rs crate.

Legacy character encoding support

An estimated 5% of the web remains in encodings other than UTF-8; too common to be treated as an error. Via marked::html::parse_buffered:

  • Decoding via encoding_rs which implements The Encoding Standard including alternative names (labels) for supported encodings.

  • HTML5 parsing restart from initial (4k) buffer with new encoding hints obtained from <head>/<meta> charset or an http-equiv content-type with charset.

  • Byte-Order-Mark BOM sniffing as high priority EncodingHint for UTF-8, UTF-16 Big-Endian and UTF-16 Little-Endian.

  • "Impossible" hints from the above are ignored. For example, if we read a hint from UTF-8 that says its UTF-16LE (which would make it impossible to read the same hint if it was used).

(Note that the detection features are not currently provided by html5ever and associated crates.)

Rust "selectors" API

A NodeRef type with "CSS selectors"-like methods to recursively select and find elements using closure predicates. We prefer direct rust language compiler support for writing such selection logic, over CSS or other interpreted DSL.

HTML tag and attribute metadata

See marked::html::t (tags) and marked::html::a (attributes) modules.

Tree walking filters API

Bulk modifications to the DOM is easily and efficiently achieved with mutating filter functions/closures and a tree walker (depth or breadth-first) implementation in marked. This style of interface is sometimes called the "visitor pattern". See Document::filter_at for details. The crate also includes the following built-in filters (a partial list):

detach_banned_element : Detach known banned (via metadata) and unknown elements

retain_basic_attributes : Remove all attributes that are not part of the "basic" logical set (via metadata)

fold_empty_inline : Fold empty or meaninglessly "inline" elements

text_normalize : Normalize text nodes by merging, replacing control characters and minimizing white-space.

An unreleased example, compatibility test and benchmark of ammonia crate equivalent filtering (for hygiene and safety) is included in the source tree (./ammonia-compare)

Roadmap

Features incomplete or unstarted which may be included in this project in the future (PRs welcome):

  • Complete (faster, more correct, legacy encodings) strict-mode XML parsing

  • Lenient-mode XML parsing

  • Optional (opt-in) direct charset detection (initial read buffer or entire document) via something like chardet, integrated as high priority EncodingHint.

  • XML/HTML pretty-indenting serialization (combines well with the existing white-space normalization features)

  • XML (and XHTML) serialization

License

This project is dual licensed under either of following:

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the märkəd project by you, as defined by the Apache License, shall be dual licensed as above, without any additional terms or conditions.