commonmark/commonmark-spec

Character references in autolinks

xiaq opened this issue · 12 comments

xiaq commented

The spec doesn't specify whether character references are supported inside autolinks. The following Markdown:

<aa:&#65;>

is rendered as the following by cmark:

<p><a href="aa:A">aa:A</a></p>

but as the following by commonmark.js:

<p><a href="aa:&amp;#65;">aa:&amp;#65;</a></p>
xiaq commented

Ah, I filed an issue about exactly the same problem in commonmark/commonmark.js#263. So it seems that the intention is to supported character references inside autolinks.

Maybe we can add an example to the spec with a character reference in an autolink?

I’m pretty strongly in the camp that character references should not work in autolinks.
Except for this, they work in the same spaces where (backslash) character escapes work.
Character escapes is in the same (preliminaries) section in the spec, and it has an example: https://spec.commonmark.org/0.30/#example-20.

I don’t think there should be one edge case where backslashes don’t work but characters references do?

jgm commented

I think the motivation was that autolinks can be URLs that you just copy from some other source, and these might contain character references.

I’m not sure about that reasoning: they might as well be fine unicode, particularly when coming from an address bar. I could see problems with double decoding.
But, most important for me: it has to be consistent with character escapes.

On motivation: do you mean cmark is more in line with your motivation? That the absence in cmjs was because it was forgotten? That no test for it in the spec was intended? What do you think about the test on character escapes but no test of character references?

jgm commented

Yes, in the linked issue, I said I thought that cmark was getting it right.
It could be worth adding a spec example for this.

jgm commented

I see why it would be nice if entities got resolved in exactly the places backslash escapes do -- but again, this is motivated by a desire to support URL copy-pasting.

Consistency with character escapes is most important to me.
If the character escapes are allowed too I am open to it. I still see a lot of inconsistency for character references in Babelmark (so good to specify whatever the choice is).
Here’s a test case of several normal cases and edge cases:

a <https://example&period;com>

b <https:&sol;&sol;example.com>

c <https&colon;//example.com>

d <&#104;ttps://example.com>

e <some&period;user@example.com>

f <some.user@example&period;com>

Note that C and D are not allowed per CommonMark as the protocol (part before and including :) does not allow &, ;, #.
And that E and F are not allowed per CM because neither the part before @ (ASCII atext) nor after (domain) allow ;.

xiaq commented

@jgm IMO there is an equally valid argument against character reference if we are talking about copy-pasting: one could also copy-paste from a place that doesn't interpret character references, like the browser's URL bar, or a displayed webpage (as opposed to the HTML source).

jgm commented

@xiaq - granted.

jgm commented

Granting that there are these two possible sources for copy/paste, I think my reasoning was that if a valid character reference occurs in a copied URL, it's by far likeliest that its source is raw HTML rather than the browser's URL bar or a displayed web page. How often does one want to display something like &amp; in a URL?