Handle invalid floating point colspan values gracefully
jerridan opened this issue · 3 comments
I ran into an issue where I was trying to convert html that had invalid floating point colspan values. This results in an error being thrown when markdownify attempts to convert it to an integer.
Ideally, I could fix the offending html, but I don't own that html. Rather than throwing an error, I think markdownify should just round to the nearest integer.
Example html:
<tr>
<td colspan="0.75">my colspan</td>
</tr>
Expected behaviour:
returns \n| my colspan\n\n
Actual behaviour:
Throws ValueError: invalid literal for int() with base 10: '0.75'
Same error with this Wikipedia page Ford
Traceback (most recent call last):
File "html_to_markdown.py", line 24, in <module>
markdown_content = MarkdownConverter().convert_soup(soup)
File "/home/dev/.local/lib/python3.10/site-packages/markdownify/__init__.py", line 100, in convert_soup
return self.process_tag(soup, convert_as_inline=False, children_only=True)
File "/home/dev/.local/lib/python3.10/site-packages/markdownify/__init__.py", line 143, in process_tag
text += self.process_tag(el, convert_children_as_inline)
File "/home/dev/.local/lib/python3.10/site-packages/markdownify/__init__.py", line 143, in process_tag
text += self.process_tag(el, convert_children_as_inline)
File "/home/dev/.local/lib/python3.10/site-packages/markdownify/__init__.py", line 143, in process_tag
text += self.process_tag(el, convert_children_as_inline)
[Previous line repeated 16 more times]
File "/home/dev/.local/lib/python3.10/site-packages/markdownify/__init__.py", line 148, in process_tag
text = convert_fn(node, text, convert_as_inline)
File "/home/dev/.local/lib/python3.10/site-packages/markdownify/__init__.py", line 381, in convert_td
colspan = int(el['colspan'])
ValueError: invalid literal for int() with base 10: ''
ValueError: invalid literal for int() with base 10: ''
I ran across this exact error, and it turns out that an empty colspan or rowspan value causes the error, for example:
<td colspan>
In my case, I resolved it with some regex replacements on colspan and rowspan not followed by an "=" before processing markdown.
html_content = re.sub(r'(colspan|rowspan)(?!=)', '', html_content)
I hope that helps.