Image inside SPAN is discarded
Closed this issue · 4 comments
If you run it on the HTML from https://softplan7189029973484399.freshdesk.com/support/solutions/articles/153000199818 you will see that the image in section "1-" and "2-" are discarded. Not all images are discarded. The common factor is the usage of SPAN. When P is used, it works.
@sglebs - this page has invalid HTML <h3> tags are being used as content containers to contain entire sections. (You can see this if you use your browser's "Inspect HTML" feature.) And because Markdownify flattens text inside heading elements, images in <h3> tags are lost.
Markdownify requires reasonably correct HTML to function. Browsers have very complex heuristics to handle bad HTML, and it is computationally impractical for Markdownify to handle these.
The HTML parser used to parse the HTML can handle some bad HTML syntaxes/structures, but using <h3> tags as containers is not one of them.
As a simple brute-force workaround, the following code unwraps all <h3> tags and allows their contents to be rendered normally:
import bs4
import markdownify
import requests
# get HTML
url = "https://softplan7189029973484399.freshdesk.com/support/solutions/articles/153000199818"
html = requests.get(url, verify=False).text
# read HTML into Beautiful Soup
soup = bs4.BeautifulSoup(html, "lxml")
# unwrap all <h3> tags
for heading in list(soup.find_all("h3")):
heading.unwrap()
# convert to Markdown
print(markdownify.MarkdownConverter().convert_soup(soup))@chrispy-snps Yes. This was content from a Freshdesk knowledge base. It is amazing that the tool allows for that kind of bad markup.
Thanks for sharing the tip on the unwrap. I guess one would need to do it for h1, h2, h3, h4, h5, h6... Anything else comes to mind?
@sglebs - yes, I was pretty horrified to see that markup. :)
I looked at a few related articles on that site. All seemed to use explicit font size and style for headings; none used real HTML heading tags. If you explore more articles and find headings being used as containers instead of headings, then specify that list of headings to find_all(), like this
find_all(["h1", "h2", "h3", "h4", "h5", "h6"])You can also use regex patterns, like this:
find_all(re.compile(r"^h\d$"))Closing as not planned. @sglebs - if you have follow-up questions, feel free to ask here.