syntax-tree/hast-util-raw

RAWTEXT/RCDATA elements are closed inconsistently

12Me21 opened this issue · 6 comments

Initial checklist

Affected packages and versions

7.2.3

Link to runnable example

No response

Steps to reproduce

import {raw} from 'hast-util-raw'
import {u} from 'unist-builder'

let test1 = u('root', [
	u('raw', '<xmp><b>bold</b>after</xmp>'),
])

let test2 = u('root', [
	u('raw', '<xmp>'),
	u('raw', '<b>'),
	u('text', 'bold'),
	u('raw', '</b>'),
	u('text', 'after'),
	u('raw', '</xmp>'),
])

console.log(raw(test1))
console.log(raw(test2))

Expected behavior

The output tree should contain:

  • root
    • element: xmp
      • text: "<b>bold</b>after"

(note: <xmp> is parsed in RAWTEXT mode, so tags/escapes aren't parsed inside it, and it only ends when the parser reaches </xmp>)

Actual behavior

  • root
    • element: xmp
      • text: "<b>bold"
    • text: "after"

the <xmp> element is being closed by the </b> tag
This affects all elements whose contents are parsed as RAWTEXT/SCRIPT_DATA (<style>, <iframe>, <noembed>, <noframes>, <script>) or RCDATA (<title>, <textarea>)

This can be seen in remark/rehype (with allowDangerousHtml enabled), in cases like:

<xmp><b>one</b><b>two</b></xmp>

which renders as:

<b>onetwo

Affected runtime and version

node@18.0.0

Affected package manager and version

npm@8.6.0

Affected OS and version

Debian

Build and bundle tools

Rollup

So, ehh, why are you using obsolete elements anyway? It’s marked as “Non-conforming features” by the HTML spec: so humans shouldn’t use it. Of course, you’re right that it would be good to fix.
A more logical reduced test case can probably be made with style, iframe, which also use rawtext (note that xmp does not relate to rcdata as far as I am aware, that’s for textarea, title).

welll, <xmp> is still pretty convenient despite being obsolete, since it's the only "normal" element which is parsed in rawtext mode
anyway, I mentioned rcdata since I believe this bug applies to those elements as well

it seems like this happens because the text node handler calls resetTokenizer(), which always resets the tokenizer into the DATA state, even when inside an element that should be parsed in some other state.
I suppose it would be possible to fix by just not resetting the state? (or only resetting it if it's not DATA/RAWTEXT/RCDATA/SCRIPT_DATA/etc.)
Edit:
Another issue is that the element node handler passes start/end tag tokens to parser._processInputToken while in RAWTEXT/RCDATA/etc. state, which is normally impossible, so parse5 just ignores them.

The result of this is:

With the markdown: <xmp>*one* *two*</xmp>, and allowDangerousHtml enabled:

  • using micromark to output html directly:
    • result: <p><xmp><em>one</em> <em>two</em></xmp></p>
      • displayed as either "<em>one</em> <em>two</em>" OR "one two" (depending on how the html is sanitized)
  • using mdast-util-from-markdown + mdast-util-to-hast + hast-util-raw + hast-util-to-html:
    • result: <p><xmp>one</xmp> <em>two</em></p>
      • displayed as "one two" (the first <em> token is ignored since we're still in RAWTEXT mode, but then the "one" text node sets the parser back to DATA mode. so </em> closes the <xmp> since there's no open <em>, (and the second em node is handled normally since we're outside xmp now))

(note: in both cases, the actual result (as displayed by a browser) is <p></p><xmp>... <p></p> because xmp is not allowed inside p, so it splits the paragraph, I think)

Here's the code i used to check the results of that:

import {micromark} from 'micromark'
import {fromMarkdown} from 'mdast-util-from-markdown'

import {toHast} from 'mdast-util-to-hast'
import {raw} from 'hast-util-raw'
import {toHtml} from 'hast-util-to-html'

import {inspect} from 'unist-util-inspect'

let text = `<xmp>*one* *two*</xmp>`

let html = micromark(text, {allowDangerousHtml: true})

let mdast = fromMarkdown(text, {allowDangerousHtml: true})
let hast = toHast(mdast, {allowDangerousHtml:true})
let hast2 = raw(hast)
let html2 = toHtml(hast2)

console.log('=== TEXT ===')
console.log(text)
console.log('=== HTML ===')
console.log(html)
console.log('='.repeat(50))
console.log('=== MDAST ===')
console.log(inspect(mdast, {showPositions:false}))
console.log('=== HAST ===')
console.log(inspect(hast, {showPositions:false}))
console.log('=== HAST (after hast-util-raw) ===')
console.log(inspect(hast2, {showPositions:false}))
console.log('=== HTML ===')
console.log(html2)

So to repeat: you really should not use xmp. https://developer.mozilla.org/en-US/docs/Web/HTML/Element/xmp.


If you start working on a PR, I recommend basing it off of #17. parse5 changed a lot last major, so this project was rewritten for that. I plan on releasing it in one or two months.

Hi! This was closed. Team: If this was fixed, please add phase/solved. Otherwise, please add one of the no/* labels.