commonmark/commonmark-spec

Numeric character references: Should HTML spec be followed for codes mapping to control characters

Opened this issue · 11 comments

The spec says this:

Decimal numeric character references consist of &# + a string of 1--7 arabic digits + ;.
A numeric character reference is parsed as the corresponding Unicode character.
Invalid Unicode code points will be replaced by the REPLACEMENT CHARACTER (U+FFFD).
For security reasons, the code point U+0000 will also be replaced by U+FFFD.

But note that commonmark.js doesn't actually follow that. Take – for example. In hex 150 is 96 which would map to U+0096, which is the (SPA) control character. However, it is converted to an en-dash –, see dingus.

The reason for this is that commonmark.js uses the entities package which implements the HTML spec. The HTML spec has a replacement table for certain characters, see parsing section:

If the number is one of the numbers in the first column of the following table, then find the row with that number in the first column, and set the character reference code to the number in the second column of that row.

Number Code point
... ...
0x96 0x2013 EN DASH (–)
... ...

(cmark behaves differently yet again, for certain code points, it uses a replacement character instead.)

Question: Should the spec prescribe the same mapping as the HTML spec? Or should it mention it and make it optional whether an implementation uses the replacement table or not?

(Originally raised here: commonmark/commonmark-java#307 (comment))

Well, first of all, I don’t think the user should do that. No need to encode an en-dash in markdown. And if you do, use – or – or –. It’s been a unicode world for a while now!

CM doesn’t follow all of HTMLs character reference stuff. It doesn’t allow &amp without the ; for example.
As I understand it, the behavior in HTML for mapping those C1 controls to an en-dash and similar, has to do with that their first rule is to not support controls like that, and then instead of using a replacement character, they choose to do something perhaps more useful, use the windows 1252 character. See also: https://github.com/wooorm/character-reference-invalid#what-is-this.

I’d personally be open to doing the same as HTML here. It would mean that every markdown parser has to embed a list though: https://github.com/wooorm/character-reference-invalid/blob/main/index.js. Is it really an improvement to the world when markdown adds workarounds for Windows 1252?
I’d rather instead use a replacement character. I understand the current spec as saying that. Along these lines:
https://github.com/micromark/micromark/blob/8b08d16891f4d6801eb0c0e990bd7b87836c0826/packages/micromark-util-decode-numeric-character-reference/dev/index.js#L20-L39.

jgm commented

I’d rather instead use a replacement character. I understand the current spec as saying that.

Well the spec says to do this for invalid code points. But 0096 isn't an invalid code point, is it?

(cmark behaves differently yet again, for certain code points, it uses a replacement character instead.)

These code points are the surrogates, they can't be represented for themselves in Unicode text or any of the Unicode transformation formats (e.g. you can't serialize or deserialize a surrogate to/from a valid UTF-8 byte stream). So specifying one as a character reference makes no sense and given that CommonMark never errors the only sensitive thing to do here is to replace them by the Unicode replacement character. HTML does the same.

When I opened #369 I already noticed here that this bit of the spec may benefit of some clarifications as to which code points exactly you are allowed to specify via numeric character references.

Regarding the issue at hand I don't have a strong opinion, but I suspect the HTML definition is rather driven by some kind of "no existing document should be broken constraint" and it would feel slightly odd to replicate that behaviour in CommonMark.

Also I would find it strange to do it for numeric character references but not for literal (raw) occurrences of such characters. But then if we do this for the literals aswell we might loose the ability to directly embed some Unicode text based format/sources in code fences. I tend to find it a good property that CommonMark lets creep in any of the oddest Unicode scalar values (except U+0000), at least in those blocks.

I’d rather instead use a replacement character. I understand the current spec as saying that.

Well the spec says to do this for invalid code points. But 0096 isn't an invalid code point, is it?

I'm not too well-versed on this--and I wouldn't use the term "invalid" myself--but I feel/understand that unicode keeps this space "empty" for compatibility reasons with other encodings. And that html only assigns things to it because they'd be "invalid" otherwise.
May understanding may be flawed tho!

Basically these "characters" are there because Unicode to fulfill its duty decided to project every other existing character encoding in it. In particular the code points U+0080 to U+00FF represent the corresponding IS0-8859-1 (latin1) codes. These include the U+0080-U+009F of so called C1 control codes. So the space is not really empty, its semantics is well defined (see the chart here).

I'm not sure if any format actually uses these codes with their purported semantics though. So one thing that CommonMark could do is to replace all the C0 and C1 control characters (the Unicode general category Cc) except TAB, CR and LF by the Unicode replacement characters. Because Cc characters tend to be forbidden in text based formats.

But then sometimes they are not. For example XML is rather liberal, or JSON while it mandates escaping of the C0 control characters does not care about having raw C1 characters in its strings.

So if most Cc characters were to be replaced by the Unicode replacement character in CommonMark, code blocks could then no longer represent every JSON document out there (it may though represent the majority of useful JSON document :-) and if literal (raw) C1 characters were to be replaced according to HTML definitions mentioned above it would get extremely confusing for those JSON documents in code blocks that have them unescaped in their strings.

Code blocks won’t be affected as this issue is about numeric character references

Code blocks won’t be affected as this issue is about numeric character references

Depends if you also apply the transform on these code points (whether replacement character or surprising html behaviour) to the literal (raw) characters that bare the same code point. This should be done in my opinion, otherwise the character escape system becomes surprising and idiosyncratic for no good reason.

(That being said personally I'm rather convinced that the good answer to the title of this issue is: no).

Hmm, I guess that could be done, but to me seems like it would be a different issue? There’s no such text for that yet in the markdown spec or in HTML:

document.body.textContent = 'a\u0096b'; document.body.textContent.codePointAt(1).toString(16) // 96

Hmm, I guess that could be done, but to me seems like it would be a different issue?

I don't see a different issue. If you are going to solve this issue by deciding to follow the HTML character mapping only on numeric character references, it's good to also reflect on what your character escape system has effectively become for end users (an inconsistent mess if you ask me).

(cmark behaves differently yet again, for certain code points, it uses a replacement character instead.)

These code points are the surrogates

Ah yes, scratch that part (I wasn't reading the code properly; I do know what surrogates are). GitHub does replace – with a replacement character, see here (but not sure where that happens, might be after cmark): �

I agree that adding the table adds some complexity and I'm not sure it's worth it for supporting some legacy way of specifying some characters. (I think I'll close the commonmark-java issue as "won't fix". Can always reopen in case the spec changes.)

oh interesting, there was just a ping on my 5 year old #614. Which is very much related.