thephpleague/html-to-markdown

&, < and >

jean-gui opened this issue · 4 comments

Is there a reason why &, < and > are not converted to &, < and >?

$converter = new HtmlConverter();

$html = "&amp; &lt; &gt; &quot; &apos;";
$markdown = $converter->convert($html);

echo $markdown;

Expected result:
& < >

Actual result:
&amp; &lt; &gt; " '

It's to ensure that converting the resulting HTML back into Markdown gives consistent results. Take the following HTML for example:

<p>&gt; test</p>

<p>&#123;</p>

<p>&lt;pre&gt; test &lt;/pre&gt;</p>

If we didn't encode them, you'd end up with this Markdown:

> test

&#123;

<pre> test </pre>

Which, if converted back into Markdown, would result in:

<blockquote><p>test</p></blockquote>

{

<pre> test </pre>

Which does not match the original HTML.

Where possible, this library tries to produce Markdown which, if run through league/commonmark, would convert back to HTML that is as close to the original input as possible.

Thanks for your response, I understand the rationale.
Is there a way to change that behavior through config options? I'm using it with Symfony mailer to produce the text version of emails (see https://symfony.com/doc/4.4/mailer.html#text-content), so converting back to HTML is not useful in this specific use case. If not, I guess there should be way for me to str_replace those entities.

There's no built-in config option for that, but since we convert them using htmlspecialchars() it shouldn't be too hard to write a little bit of code to convert those back once you get the Markdown back this library.

OK, thanks for the info. I'm closing this issue.