EvitanRelta/htmlarkdown

Should not use markdown-escaping inside of HTML-syntax

EvitanRelta opened this issue · 2 comments

The problem

Currently, this HTML:

<p align="center">
  &lt;tag&gt;
</p>

converts to:

<p align="center">
  \<tag>
</p>

which incorrectly uses markdown's backslash-escaping, instead of HTML's &lt; escaping.


Edge cases

Most of the time, while inside HTML tags, markdown-syntax (including backslash escaping) doesn't work.
However, there are times when it does, specifically in tags which are:

  • In-line (eg. text-formattings <em> / <code> & span)
  • are in a single-line in the markdown

For example, these markdown-syntax containing tags render properly:

<code>\<tag> \&nbsp; **Bold**</code>
<sup>\<tag> \&nbsp;  **Bold**</sup>
<span>\<tag> \&nbsp;  **Bold**</span>

Rendered as:

<tag> &nbsp; Bold
<tag> &nbsp; Bold
<tag> &nbsp; Bold


But when they are broken up into multi-lines, the markdown-syntax stop working:

<code>
  \<tag> \&nbsp; **Bold**
</code>
<sup>
  \<tag> \&nbsp;  **Bold**
</sup>
<span>
  \<tag> \&nbsp;  **Bold**
</span>

Rendered as:

\ \  **Bold** \ \  **Bold** \ \  **Bold**

To keep the spirit of keeping output as readable as possible, with as little HTML as possible,
instead of just escaping the usual 5 characters for HTML (ie. &, <, >, ", '),
or even the 3 main characters (ie. &, <, >),
I propose to escape a character based on what's around it.

For example:

<p forcehtml>&lt;div&gt;</p>
<p forcehtml>I &lt;3 Justin Bieber</p>
<p forcehtml>Cookies & cream</p>
<p forcehtml>Empty ampersand escape: &;</p>

would be converted to:

<p>&lt;div></p>

<p>I <3 Justin Bieber</p>

<p>Cookies & cream</p>

<p>Empty ampersand escape: &;</p>

Which properly renders as:

<div>

I <3 Justin Bieber

Cookies & cream

Empty ampersand escape: &;


Update:

Turns out there are more rules than I though on which characters must be escaped.
For example, this:

<p>"&amp;#xA": &#xA</p>
<p>"&amp;#<!--GH_ISSUE_AUTOLINK_BUSTER-->3": &#3</p>
<p>"&lt;/&gt;": </></p>
<p>"&lt;?&gt;": <?></p>
<p>"&lt;!&gt;": <!></p>

Renders in Github as:

"&#xA":

"&#3": �

"</>":

"<?>":

"<!>":


I've settled on these 2 regex:

/&(?=#[0-9]|#x\w|\w)/g, which escapes to: "&amp;"

/<(?=[!?/a-z])/gi which escapes to: "&lt;"

Then maybe add an option to turn off this "conservative escaping" feature, to either escape the 3 characters (ie. &, <, >) or all 5 characters (ie. &, <, >, ", ').