wilsonzlin/minify-html

HTML document containing invalid tag </> is truncated when minified

tunniclm opened this issue · 5 comments

Using @minify-html/node version v0.11.1 on website source found in the wild that contained the character sequence </>, after passing through minify() the document is cut short.

Example:

> minifyHtml.minify(Buffer.from('<body><p>1</><p>2</body>'), {}).toString()
'<body><p>1'

Compared to the result when the document does not contain </>:

> minifyHtml.minify(Buffer.from('<body><p>1<p>2</body>'), {}).toString()
'<body><p>1<p>2'

This is expected as per the parsing rules. It also matches the WHATWG spec.

wilsonzlin wrote:

This is expected as per the parsing rules. It also matches the WHATWG spec.


Regarding the parsing rules, I can't quite work out which rules apply that would cause </> to end the document.

I see this rule below might apply, meaning the sequence is not considered a closing tag? Although I'm not sure of the implications of that -- would that mean it is considered literal text?

If the character following </ is not a valid tag name character, all code until the next > is dropped. It is not considered a closing tag, even as an invalid one.

And maybe this rule below could also apply, since it's a closing tag that doesn't match anything, so that would mean it should be ignored rather than ending the document?

If a closing tag does not match the opening tag, and the closing tag cannot be omitted as per the spec, the closing tag is ignored. NOTE: Most browsers have far more complex logic.


Regarding the WHATWG spec:

missing-end-tag-name This error occurs if the parser encounters a U+003E (>) code point where an end tag name is expected, i.e., </>. The parser ignores the whole "</>" code point sequence.

Reading the WHATWG spec made me expect that the </> sequence would be ignored, which would mean that <p>2 would not be omitted in the first example - is that incorrect?

@fierydrake You're right, thanks for the clarification, there was a small bug in the parsing logic for malformed closing tags. The fix will be released in the next version. Thanks @tunniclm for raising the issue and let me know if the issue is resolved in the next version.

Regarding the parsing rules mentioned:

  • A malformed closing tag is simply dropped completely, and not interpreted as text or anything else. It's equivalent to it not existing. The rule applies to the example code mentioned in this issue.
  • A closing tag that doesn't match the opening tag is also dropped completely, and also not interpreted as text or anything else. However, in this case the tag is malformed, not well-formed but not matching.

This has been fixed as of version 0.11.3, @tunniclm let me know if it works on your end.

@wilsonzlin This is working for me now with the latest version and I can remove my workaround. Thanks!