thephpleague/html-to-markdown

Links with underscores will be escaped

dragonito opened this issue · 6 comments

Hi there,

got a problem with a special kind of urls looking like

https://example.de/this_is_a_nice_file.ext

it will be converted to:

https://example.de/this\_is\_a\_nice\_file.ext

Perhaps you got an idea to fix it?

Cu Robin

Would you be able to share the exact HTML that causes this behavior? That will help us to replicate the issue and come up with a fix :)

Its really simple:

<p>https://www.dragonito.net/test_hallo_welt.pdf<br/></p>

will be

https://www.dragonito.net/test\_hallo\_welt.pdf

I use some options:

$options = [ 'strip_tags' => true, 'header_style' => 'atx', ];

hope it helps to find the problem, thanks a lot :D

The problem is occurring here:

// Escape the following characters: '*', '_', '[', ']' and '\'
if (($parent = $element->getParent()) && $parent->getTagName() !== 'div') {
$markdown = \preg_replace('~([*_\\[\\]\\\\])~u', '\\\\$1', $markdown);

This logic is unaware of the complex rules that CommonMark has for determining whether underscore characters within words would be treated as emphasis or not. Unfortunately, this isn't going to be easy to fix in a reliable way without porting over all of that delimiter parsing functionality, which is very low on my list of open-source priorities right now.

I'd be open to any ideas for easier fixes that fix this particular case without breaking our existing tests.

A potential naive solution might be breaking this regex replacement into three steps:

  1. Escape any \ characters first (to prevent double-escaping later)
  2. Escape any _ characters that do not appear within words (without accounting for all of CommonMark's edge cases)
  3. Escape the other characters as usual

It wouldn't be perfect but it would be much closer to what you'd expect.

hi @colinodell i played a little bit, added a test and check if the plain text is an url, if its an url I ignore _ escaping. Perhaps not the best solution, but a possible inspiration for fixing it

stale commented

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.