Links with underscores will be escaped
dragonito opened this issue · 6 comments
Hi there,
got a problem with a special kind of urls looking like
https://example.de/this_is_a_nice_file.ext
it will be converted to:
https://example.de/this\_is\_a\_nice\_file.ext
Perhaps you got an idea to fix it?
Cu Robin
Would you be able to share the exact HTML that causes this behavior? That will help us to replicate the issue and come up with a fix :)
Its really simple:
<p>https://www.dragonito.net/test_hallo_welt.pdf<br/></p>
will be
https://www.dragonito.net/test\_hallo\_welt.pdf
I use some options:
$options = [ 'strip_tags' => true, 'header_style' => 'atx', ];
hope it helps to find the problem, thanks a lot :D
The problem is occurring here:
html-to-markdown/src/Converter/TextConverter.php
Lines 22 to 24 in 7bd4aa5
This logic is unaware of the complex rules that CommonMark has for determining whether underscore characters within words would be treated as emphasis or not. Unfortunately, this isn't going to be easy to fix in a reliable way without porting over all of that delimiter parsing functionality, which is very low on my list of open-source priorities right now.
I'd be open to any ideas for easier fixes that fix this particular case without breaking our existing tests.
A potential naive solution might be breaking this regex replacement into three steps:
- Escape any
\
characters first (to prevent double-escaping later) - Escape any
_
characters that do not appear within words (without accounting for all of CommonMark's edge cases) - Escape the other characters as usual
It wouldn't be perfect but it would be much closer to what you'd expect.
hi @colinodell i played a little bit, added a test and check if the plain text is an url, if its an url I ignore _ escaping. Perhaps not the best solution, but a possible inspiration for fixing it
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.