Self-closing tags (such as `<p/>`) are processed incorrectly
Opened this issue · 0 comments
This is a more specific follow-up to #181.
When a self-closing tag is processed (such as <p/>
), the output is an incorrectly unclosed tag (such as <p>
). This causes significant structural issues when the content is read back in.
For example, the following code:
import minify_html
html = """
<p>
<span>ABC</span>
<span/>
<span/>
<span/>
<span/>
<span/>
<span>DEF</span>
</p>
<math xmlns="http://www.w3.org/1998/Math/MathML">
<mi>ABC</mi>
<mi/>
<mi/>
<mi/>
<mi>DEF</mi>
</math>
"""
html_small = minify_html.minify(html, keep_closing_tags=True)
print(html_small)
results in the following HTML (added linefeeds are mine):
<p>
<span>ABC</span>
<span>
<span>
<span>
<span>
<span>
<span>DEF</span>
<math xmlns="http://www.w3.org/1998/Math/MathML">
<mi>ABC</mi>
<mi>
<mi>
<mi>
<mi>DEF</mi>
which is interpreted by a browser (Firefox) as follows:
<p>
<span>ABC</span>
<span>
<span>
<span>
<span>
<span>
<span>DEF</span>
<math xmlns="http://www.w3.org/1998/Math/MathML">
<mi>ABC</mi>
<mi>
<mi>
<mi>
<mi>DEF</mi>
</mi>
</mi>
</mi>
</math>
</span>
</span>
</span>
</span>
</span>
</p>
Self-closing <p/>
elements pose a similar issue. While many browsers will force-close adjacent unclosed <p>
elements due to their block-element-ness, many parsers (such as lxml
) do not, and a similar cascade of misclosed <p>
tags occurs there too.
We are able to work around it as follows:
import re
html = re.sub(
r"<([^\s>]+)([^>]*)/>",
r"<\1\2></\1>",
html,
flags=re.DOTALL,
)
but a proper fix would be better (and more efficient, as we process tens of thousands of HTML files at a time). Either self-closing tags should be self-closed by default (it's one more character), or they should be kept when keep_closing_tags==True
(when working with a downstream parser that expects predominantly well-formed HTML).