RAWTEXT/RCDATA elements are closed inconsistently
12Me21 opened this issue · 6 comments
Initial checklist
- I read the support docs
- I read the contributing guide
- I agree to follow the code of conduct
- I searched issues and couldn’t find anything (or linked relevant results below)
Affected packages and versions
7.2.3
Link to runnable example
No response
Steps to reproduce
import {raw} from 'hast-util-raw'
import {u} from 'unist-builder'
let test1 = u('root', [
u('raw', '<xmp><b>bold</b>after</xmp>'),
])
let test2 = u('root', [
u('raw', '<xmp>'),
u('raw', '<b>'),
u('text', 'bold'),
u('raw', '</b>'),
u('text', 'after'),
u('raw', '</xmp>'),
])
console.log(raw(test1))
console.log(raw(test2))Expected behavior
The output tree should contain:
- root
- element:
xmp- text: "<b>bold</b>after"
- element:
(note: <xmp> is parsed in RAWTEXT mode, so tags/escapes aren't parsed inside it, and it only ends when the parser reaches </xmp>)
Actual behavior
- root
- element:
xmp- text: "<b>bold"
- text: "after"
- element:
the <xmp> element is being closed by the </b> tag
This affects all elements whose contents are parsed as RAWTEXT/SCRIPT_DATA (<style>, <iframe>, <noembed>, <noframes>, <script>) or RCDATA (<title>, <textarea>)
This can be seen in remark/rehype (with allowDangerousHtml enabled), in cases like:
<xmp><b>one</b><b>two</b></xmp>
which renders as:
<b>onetwo
Affected runtime and version
node@18.0.0
Affected package manager and version
npm@8.6.0
Affected OS and version
Debian
Build and bundle tools
Rollup
So, ehh, why are you using obsolete elements anyway? It’s marked as “Non-conforming features” by the HTML spec: so humans shouldn’t use it. Of course, you’re right that it would be good to fix.
A more logical reduced test case can probably be made with style, iframe, which also use rawtext (note that xmp does not relate to rcdata as far as I am aware, that’s for textarea, title).
welll, <xmp> is still pretty convenient despite being obsolete, since it's the only "normal" element which is parsed in rawtext mode
anyway, I mentioned rcdata since I believe this bug applies to those elements as well
it seems like this happens because the text node handler calls resetTokenizer(), which always resets the tokenizer into the DATA state, even when inside an element that should be parsed in some other state.
I suppose it would be possible to fix by just not resetting the state? (or only resetting it if it's not DATA/RAWTEXT/RCDATA/SCRIPT_DATA/etc.)
Edit:
Another issue is that the element node handler passes start/end tag tokens to parser._processInputToken while in RAWTEXT/RCDATA/etc. state, which is normally impossible, so parse5 just ignores them.
The result of this is:
With the markdown: <xmp>*one* *two*</xmp>, and allowDangerousHtml enabled:
- using
micromarkto output html directly:- result:
<p><xmp><em>one</em> <em>two</em></xmp></p>- displayed as either "<em>one</em> <em>two</em>" OR "one two" (depending on how the html is sanitized)
- result:
- using
mdast-util-from-markdown+mdast-util-to-hast+hast-util-raw+hast-util-to-html:- result:
<p><xmp>one</xmp> <em>two</em></p>- displayed as "one two" (the first <em> token is ignored since we're still in RAWTEXT mode, but then the "one" text node sets the parser back to DATA mode. so </em> closes the <xmp> since there's no open <em>, (and the second em node is handled normally since we're outside xmp now))
- result:
(note: in both cases, the actual result (as displayed by a browser) is <p></p><xmp>... <p></p> because xmp is not allowed inside p, so it splits the paragraph, I think)
Here's the code i used to check the results of that:
import {micromark} from 'micromark'
import {fromMarkdown} from 'mdast-util-from-markdown'
import {toHast} from 'mdast-util-to-hast'
import {raw} from 'hast-util-raw'
import {toHtml} from 'hast-util-to-html'
import {inspect} from 'unist-util-inspect'
let text = `<xmp>*one* *two*</xmp>`
let html = micromark(text, {allowDangerousHtml: true})
let mdast = fromMarkdown(text, {allowDangerousHtml: true})
let hast = toHast(mdast, {allowDangerousHtml:true})
let hast2 = raw(hast)
let html2 = toHtml(hast2)
console.log('=== TEXT ===')
console.log(text)
console.log('=== HTML ===')
console.log(html)
console.log('='.repeat(50))
console.log('=== MDAST ===')
console.log(inspect(mdast, {showPositions:false}))
console.log('=== HAST ===')
console.log(inspect(hast, {showPositions:false}))
console.log('=== HAST (after hast-util-raw) ===')
console.log(inspect(hast2, {showPositions:false}))
console.log('=== HTML ===')
console.log(html2)
So to repeat: you really should not use xmp. https://developer.mozilla.org/en-US/docs/Web/HTML/Element/xmp.
If you start working on a PR, I recommend basing it off of #17. parse5 changed a lot last major, so this project was rewritten for that. I plan on releasing it in one or two months.
Hi! This was closed. Team: If this was fixed, please add phase/solved. Otherwise, please add one of the no/* labels.