wilsonzlin/minify-html

Self-closing tags (such as `<p/>`) are processed incorrectly

Opened this issue · 0 comments

This is a more specific follow-up to #181.

When a self-closing tag is processed (such as <p/>), the output is an incorrectly unclosed tag (such as <p>). This causes significant structural issues when the content is read back in.

For example, the following code:

import minify_html

html = """
<p>
  <span>ABC</span>
  <span/>
  <span/>
  <span/>
  <span/>
  <span/>
  <span>DEF</span>
</p>

<math xmlns="http://www.w3.org/1998/Math/MathML">
  <mi>ABC</mi>
  <mi/>
  <mi/>
  <mi/>
  <mi>DEF</mi>
</math>
"""

html_small = minify_html.minify(html, keep_closing_tags=True)
print(html_small)

results in the following HTML (added linefeeds are mine):

<p>
<span>ABC</span>
<span>
<span>
<span>
<span>
<span>
<span>DEF</span>
<math xmlns="http://www.w3.org/1998/Math/MathML">
<mi>ABC</mi>
<mi>
<mi>
<mi>
<mi>DEF</mi>

which is interpreted by a browser (Firefox) as follows:

<p>
  <span>ABC</span>
  <span>
    <span>
      <span>
        <span>
          <span>
	        <span>DEF</span>
	        <math xmlns="http://www.w3.org/1998/Math/MathML">
              <mi>ABC</mi>
              <mi>
                <mi>
                  <mi>
                    <mi>DEF</mi>
                  </mi>
                </mi>
              </mi>
            </math>
          </span>
        </span>
      </span>
    </span>
  </span>
</p>

Self-closing <p/> elements pose a similar issue. While many browsers will force-close adjacent unclosed <p> elements due to their block-element-ness, many parsers (such as lxml) do not, and a similar cascade of misclosed <p> tags occurs there too.

We are able to work around it as follows:

import re

html = re.sub(
    r"<([^\s>]+)([^>]*)/>",
    r"<\1\2></\1>",
    html,
    flags=re.DOTALL,
)

but a proper fix would be better (and more efficient, as we process tens of thousands of HTML files at a time). Either self-closing tags should be self-closed by default (it's one more character), or they should be kept when keep_closing_tags==True (when working with a downstream parser that expects predominantly well-formed HTML).