whatwg/infra

Isolated Surrogates

wlammen opened this issue · 4 comments

In https://infra.spec.whatwg.org/#strings you write:

Note: The replaced surrogates are always isolated surrogates, since the process of interpreting the string as containing code points will have converted surrogate pairs into scalar values. ...

This is not correct. A counter example is a string composed of two surrogates:

U+D800 U+D800

They do not form a valid surrogate pair, as they are both from the low-surrogate range. Unicode demands the first being a low surrogate, the second a high-surrogate. This is not the case here, they can never represent a Unicode codepoint. In such a case you most likely want both be replaced with U+FFFD REPLACEMENT CHARACTER. Obviously, the surrogates are NOT isolated.

Wolf Lammen

That's a sequence of two isolated surrogates, see section 3.8 of Unicode.

I haven't found where Unicode defines an Isolated Surrogate, see https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G2630 , but it uses this expression in one explanatory paragraph.

I understood the word isolated as meaning to be separated from other surrogates by some other code units, which is not the case in my example.

I am not going to insist on this any further. See this issue as an example, how the Infra standard can be misinterpreted by someone familiar with Unicode (meaning has once worked through the whole text), but not to the point, where each and every non-normative detail written somewhere is still present.

I leave it up to you to decide whether this has a consequence or not.

Since Unicode doesn't seem to define the term, I think it'd be best to clarify, especially since the only uses in Infra are in non-normative text.

I created a PR that makes Infra avoid saying isolated surrogate, but Unicode does seem to use that terminology for equivalent non-normative explanations, such as in section 2.7. Strangely section 5.4 does not touch upon it.

I'm somewhat curious if @macchiati or @markusicu would have the time and are willing to add their two cents on this topic.