div mis-converted

Question

div mis-converted

mirabilos opened this issue 2 years ago · 6 comments

>>> MarkdownConverter().convert('<div>foo</div><div>bar<span>baz</span><span>meow</span></div>')
'foobarbazmeow'

Expected: 'foo \nbarbazmeow'

Answer 1 · 2024-01-29T01:21:59.000Z

Looking at the code, this is probably very hard to do because for this the second div has to lookbehind and see that the previous text did not end with a \n\n hard paragraph break.

Answer 2 · 2024-01-29T01:35:14.000Z

Workaround (this relies on the fix for #92 to be applied):

>>> text = '<div>foo</div><div>bar<span>baz</span><span>meow</span></div>'
>>> html = bs4.BeautifulSoup(text, 'html.parser')
>>> for e in html.find_all('div'):
...     e.insert_before(html.new_tag('br'))
... 
>>> text = MarkdownConverter().convert_soup(html)
>>> text = re.sub('  \n  \n', '\n\n', text)
>>> text = re.sub(' *\n\n+', '\n\n', text).strip()
>>> text
'foo  \nbarbazmeow'

Maybe it helps someone.

For the sake of completeness, the following is my current complete example of how I clean up HTML from RSS feeds to post it, as Markdown, to Fediverse (called as cleanup(post.x) where x is title, summary, content, …), which includes a number of workarounds for bad input and limits of the conversion tools:

def _cleanup_tablish(tag):
    for e in tag.contents:
        if isinstance(e, bs4.element.NavigableString) and str(e).strip() == '':
            e.extract()
            return True
    return False

def _cleanup_table(top):
    tag = top
    while isinstance(tag, bs4.element.Tag) and \
      tag.name in ('table', 'tbody', 'tr', 'th', 'td'):
        while _cleanup_tablish(tag):
            pass
        have_tablish = False
        have_nontablish = False
        have_elts = 0
        for e in tag.contents:
            if isinstance(e, bs4.element.NavigableString):
                have_nontablish = True
            elif e.name in ('table', 'tbody', 'tr', 'th', 'td'):
                have_tablish = True
            else:
                have_nontablish = True
            have_elts = have_elts + 1
        if have_elts == 0:
            top.extract()
            return
        if have_nontablish:
            if have_tablish:
                # huh?
                return
            tag.name = 'div'
            tag.attrs.clear()
            e = tag.contents[0]
            if have_elts == 1 and isinstance(e, bs4.element.Tag) and \
              e.name in bs4.builder.HTMLTreeBuilder.block_elements:
                tag = e
            if tag != top:
                top.replace_with(tag)
            return
        if have_elts > 1:
            return
        tag = tag.contents[0]

_cleanup_traildots = re.compile('\\.\\.\\.$')
def cleanup(text):
    text = re.sub('\r+\n?', '\n', text)
    html = bs4.BeautifulSoup(text, 'html.parser', multi_valued_attributes=None)
    # remove <!-- comments -->
    for e in html.find_all(string=lambda e: isinstance(e, bs4.element.Comment)):
        e.extract()
    # flatten tables with only one cell (Goodreads)
    for e in html.find_all('table'):
        _cleanup_table(e)
    # expand shortened links
    for e in html.find_all('a', href=True, string=_cleanup_traildots):
        href = str(e['href'])
        if href.startswith(str(e.string).rstrip('.')):
            e.string.replace_with(href)
    # temporarily move <pre>s aside
    pres = []
    npres = 0
    for pre in html.find_all('pre'):
        pres.append(pre.replace_with(html.new_tag('rpre', num=npres)))
        npres = npres + 1
    # clean whitespace except in the extracted <pre>s
    text = str(html)
    text = re.sub(' *\n *', '\n', text)
    text = text.replace('\n', '\1')
    text = re.sub('\1\1\1+', '\n\n', text)
    text = re.sub('\1+ *', ' ', text).strip()
    text = re.sub('[\t ]+', ' ', text)
    # bring back the extracted <pre>s
    html = bs4.BeautifulSoup(text, 'html.parser')
    for pre in html.find_all('rpre'):
        pre.replace_with(pres[int(pre.attrs['num'])])
    # work around https://github.com/matthewwithanm/python-markdownify/issues/107
    for e in html.find_all('div'):
        e.insert_before(html.new_tag('br'))
    # convert and clean up
    text = MarkdownConverter(strip=['img']).convert_soup(html)
    text = re.sub('  \n  \n', '\n\n', '\n' + text + '\n')
    text = re.sub('(\n> )+\n', '\n> \n', '\n' + text + '\n')
    text = re.sub(' *\n\n+', '\n\n', text)
    return text.strip()

Answer 3 · 2024-06-10T17:50:19.000Z

We're hitting this too. It is a difficult fix in the current architecture.

Answer 4 · 2024-11-24T20:36:01.000Z

Thanks for reporting this! Indeed it is quite hard to fix. Especially we would have to decide if divs always behave as paragraphs. If we do, we could just handle divs the save as p and it would be fixed. But I think this would break different stuff. I'm open to suggestions, tho.

Answer 5 · 2024-11-24T21:11:35.000Z

On Sun, 24 Nov 2024, AlexVonB wrote: Especially we would have to decide if divs always behave as paragraphs.

Uhm… no need to decide it, there’s already a spec for that ;-) Basically, a div forces the browser to begin at a new line, thinking in terms like what text browsers such as lynx would use. If the current position is already at a new line (e.g. because there was a </p> before it, no need to do anything; otherwise, forcing a linebreak is needed. So they specifically _don’t_ behave like paragraphs, which also have inter-paragraph spacing. Hence, the example and expected text in the submission above. bye, //mirabilos -- "Using Lynx is like wearing a really good pair of shades: cuts out the glare and harmful UV (ultra-vanity), and you feel so-o-o COOL." -- Henry Nelson, March 1999