Namespaces
wooorm opened this issue · 9 comments
TL;DR
I’m thinking out loud. We need namespace information. I can think of three solutions. Not sure which is best.
Introduction
HTML has the concept of elements: things like <strong></strong> are normal elements. There’s a subcategory of “foreign elements”: those from MathML (mi) or from SVG (rect).
A practical example of why this information is needed is because of tag-name normalisation: in HTML, tag-names are case-insensitive. In SVG or MathML, they are not. And, unfortunately tag-names themselves cannot be used to detect whether an element is foreign or not, because there are elements which exist in multiple spaces. For example: var in HTML and MathML, and a in HTML and SVG.
Take the following code:
<!doctype html>
<title>Foreign elements in HTML</title>
<h1>HTML</h1>
<a href="#">HTML link</a>
<var>htmlVar</var>
<svg>
<a href="#">SVG link</a>
<span>SVG</span>
<a href="#">SVG link</a>
</svg>
<math>
<mi>mathMLVar</mi>
<span>MathML</span>
<mi>mathMLVar</mi>
</math>When running the following script:
var length = document.all.length;
var index = -1;
var node;
while (++index < length) node = document.all[index], console.log([node.tagName, node.namespaceURI, node.textContent]);Yields:
[Log] ["HTML", "http://www.w3.org/1999/xhtml", "Foreign elements in HTML↵HTML↵HTML link↵htmlVar↵↵ …G↵ SVG link↵↵↵ mathMLVar↵ MathML↵ mathMLVar↵↵"] (3)
[Log] ["HEAD", "http://www.w3.org/1999/xhtml", "Foreign elements in HTML↵"] (3)
[Log] ["TITLE", "http://www.w3.org/1999/xhtml", "Foreign elements in HTML"] (3)
[Log] ["BODY", "http://www.w3.org/1999/xhtml", "HTML↵HTML link↵htmlVar↵↵ SVG link↵ SVG↵ SVG link↵↵↵ mathMLVar↵ MathML↵ mathMLVar↵↵"] (3)
[Log] ["H1", "http://www.w3.org/1999/xhtml", "HTML"] (3)
[Log] ["A", "http://www.w3.org/1999/xhtml", "HTML link"] (3)
[Log] ["VAR", "http://www.w3.org/1999/xhtml", "htmlVar"] (3)
[Log] ["svg", "http://www.w3.org/2000/svg", "↵ SVG link↵ "] (3)
[Log] ["a", "http://www.w3.org/2000/svg", "SVG link"] (3)
[Log] ["SPAN", "http://www.w3.org/1999/xhtml", "SVG"] (3)
[Log] ["A", "http://www.w3.org/1999/xhtml", "SVG link"] (3)
[Log] ["math", "http://www.w3.org/1998/Math/MathML", "↵ mathMLVar↵ "] (3)
[Log] ["mi", "http://www.w3.org/1998/Math/MathML", "mathMLVar"] (3)
[Log] ["SPAN", "http://www.w3.org/1999/xhtml", "MathML"] (3)
[Log] ["MI", "http://www.w3.org/1999/xhtml", "mathMLVar"] (3)Note 1: Non-foreign elements break out of their foreign context.
Note 2: HTML is case-insensitive (normalised to upper-case), foreign elements are case-sensitive.
Proposal
I propose either of the following:
- Add
namespaceon some nodes (notably,root,<mathml>,<svg>). To determine the namespace of a node, check its closest ancestor with a namespace. - Add
namespaceonrootnodes (and wrap<svg>and<mathml>inroots). To determine the namespace of a node, check its closestrootfor a namespace. This changes the semantics ofroots somewhat. - Add
namespaceon any element.
The downsides of the first two as that it’s hard to determine the namespace from an element in a syntax tree without ancestral getters. However, both make moving nodes around quite easy.
The latter is verbose, but does allow for easy access. However, it makes it easy for things to go wrong when shuffling nodes around.
Note: detecting namespaces upon creation (in rehype-parse), is very do-able. I’d like to make the usage of hastscript and transformers very easy too, though!
Do let me know about your thoughts on this!
I have no strong opinion on this, but I lean towards one of the first two options. The third option seems unnecessary cluttered and error-prone (easier to be messed up by plugins).
Since namespaces are naturally nested, it seems logical to have a rule that a namespace is determined by closest namespace property (on whatever node it is), and it doesn't seem beneficial to require namespace property to be on a node of a fixed type (but it's fine by me, too).
👍
I’m leaning towards the first. It’ll be easy to add (bookkeeping is already sone when parsing, and one walk down could opt to add only necessary namespaces), and it’ll be easy to handle for plug-in authors.
Is there anything else that namespaces would be used for between parsing and compilation, except to determine the proper tag/attribute casing of an element?
I can think of lots of parse differences, but those are handled internally (I’m about to switch to a much better parser, parse5) already. Then, there’s compilation differences, but those can of course be handled there pretty OK (as it’s in rehype-stringify).
Major use case for user-land would be to not walk into SVG / MathML by accident, I think. Hmm. That can be checked easily by determining whether an element is svg / math, though...
Major use case for user-land would be to not walk into SVG / MathML by accident
Have you thought about making it explicit then? Maybe some other property like { subtree: "..." } instead of children? So that in HTML-land the whole SVG is just a single black-box element, which can still be manipulated if needed.
That can be checked easily by determining whether an element is svg / math, though...
How would it be different from namespaces?
How would it be different from namespaces?
Not really different, just one less property.
Have you thought about making it explicit then? Maybe some other property like { subtree: "..." } instead of children? So that in HTML-land the whole SVG is just a single black-box element, which can still be manipulated if needed.
That’s also possible, and it would make HAST more like programming language syntax trees.
I prefer the earlier idea of sub-roots for black boxes though. Still using children, but with notable semantics. And possible in the current, quite minimal, Unist interface.
There’s also an edge case where HTML is in SVG/MathML, which itself is in HTML. If either namespaces or subroots were used, it would be possible to walk into trees, remembering the current namespace, and transforming just the HTML namespaced elements.
Not doing it for now. Maybe a utility which, when given a node, checks if it’s a foreign element, would do the trick.
Linking to rehypejs/rehype#2 (comment).
I thought about it a bit and I’d like to work on this now. For starters, this issue is now first tracked in wooorm/property-information#6. When that is done, we can work on updating it throughout the ecosystem.
I think we may be able to do without namespaces. But maybe we need to have, just like template, a content property for foreign content instead.
I’ll close this now, again, if anyone has any further comments please post them there!