Self-closing tags (such as ``) are processed incorrectly

Question

Self-closing tags (such as ``) are processed incorrectly

Opened this issue 7 months ago · 0 comments

This is a more specific follow-up to #181.

When a self-closing tag is processed (such as ), the output is an incorrectly unclosed tag (such as ). This causes significant structural issues when the content is read back in.

For example, the following code:

import minify_html

html = """
<p>
  <span>ABC</span>
  <span/>
  <span/>
  <span/>
  <span/>
  <span/>
  <span>DEF</span>
</p>

<math xmlns="http://www.w3.org/1998/Math/MathML">
  <mi>ABC</mi>
  <mi/>
  <mi/>
  <mi/>
  <mi>DEF</mi>
</math>
"""

html_small = minify_html.minify(html, keep_closing_tags=True)
print(html_small)

results in the following HTML (added linefeeds are mine):

<p>
<span>ABC</span>
<span>
<span>
<span>
<span>
<span>
<span>DEF</span>
<math xmlns="http://www.w3.org/1998/Math/MathML">
<mi>ABC</mi>
<mi>
<mi>
<mi>
<mi>DEF</mi>

which is interpreted by a browser (Firefox) as follows:

<p>
  <span>ABC</span>
  <span>
    <span>
      <span>
        <span>
          <span>
	        <span>DEF</span>
	        <math xmlns="http://www.w3.org/1998/Math/MathML">
              <mi>ABC</mi>
              <mi>
                <mi>
                  <mi>
                    <mi>DEF</mi>
                  </mi>
                </mi>
              </mi>
            </math>
          </span>
        </span>
      </span>
    </span>
  </span>
</p>

Self-closing  elements pose a similar issue. While many browsers will force-close adjacent unclosed  elements due to their block-element-ness, many parsers (such as lxml) do not, and a similar cascade of misclosed  tags occurs there too.

We are able to work around it as follows:

import re

html = re.sub(
    r"<([^\s>]+)([^>]*)/>",
    r"<\1\2></\1>",
    html,
    flags=re.DOTALL,
)

but a proper fix would be better (and more efficient, as we process tens of thousands of HTML files at a time). Either self-closing tags should be self-closed by default (it's one more character), or they should be kept when keep_closing_tags==True (when working with a downstream parser that expects predominantly well-formed HTML).