matthewwithanm/python-markdownify

Handle invalid floating point colspan values gracefully

jerridan opened this issue · 3 comments

I ran into an issue where I was trying to convert html that had invalid floating point colspan values. This results in an error being thrown when markdownify attempts to convert it to an integer.

Ideally, I could fix the offending html, but I don't own that html. Rather than throwing an error, I think markdownify should just round to the nearest integer.

Example html:

<tr>
  <td colspan="0.75">my colspan</td>
</tr>

Expected behaviour:
returns \n| my colspan\n\n

Actual behaviour:
Throws ValueError: invalid literal for int() with base 10: '0.75'

Same error with this Wikipedia page Ford

Traceback (most recent call last):
  File "html_to_markdown.py", line 24, in <module>
    markdown_content  = MarkdownConverter().convert_soup(soup)
  File "/home/dev/.local/lib/python3.10/site-packages/markdownify/__init__.py", line 100, in convert_soup
    return self.process_tag(soup, convert_as_inline=False, children_only=True)
  File "/home/dev/.local/lib/python3.10/site-packages/markdownify/__init__.py", line 143, in process_tag
    text += self.process_tag(el, convert_children_as_inline)
  File "/home/dev/.local/lib/python3.10/site-packages/markdownify/__init__.py", line 143, in process_tag
    text += self.process_tag(el, convert_children_as_inline)
  File "/home/dev/.local/lib/python3.10/site-packages/markdownify/__init__.py", line 143, in process_tag
    text += self.process_tag(el, convert_children_as_inline)
  [Previous line repeated 16 more times]
  File "/home/dev/.local/lib/python3.10/site-packages/markdownify/__init__.py", line 148, in process_tag
    text = convert_fn(node, text, convert_as_inline)
  File "/home/dev/.local/lib/python3.10/site-packages/markdownify/__init__.py", line 381, in convert_td
    colspan = int(el['colspan'])
ValueError: invalid literal for int() with base 10: ''

@LeMoussel

ValueError: invalid literal for int() with base 10: ''

I ran across this exact error, and it turns out that an empty colspan or rowspan value causes the error, for example:
<td colspan>
In my case, I resolved it with some regex replacements on colspan and rowspan not followed by an "=" before processing markdown.
html_content = re.sub(r'(colspan|rowspan)(?!=)', '', html_content)
I hope that helps.

Thanks for reporting this! Dupicate of #126, closed with 2ec3338