Checked UCS-2 decode

Question

Checked UCS-2 decode

demurgos opened this issue 7 years ago · 5 comments

Hi,
As you know, UCS-2 and UTF-16 are pretty similar but UCS-2 has 16-bit code-units directly encoding code-points from the BMP, some of which are non-assigned because they correspond to code-units of surrogate halves of UTF-16. The environment / renderer then may merge these special code-points as if they were UTF-16 code-units in surrogate pairs but it happens outside of the JS engine (and feels a bit like a hack). UTF-16 does not allow code-units for unmatched surrogate-halves or reversed surrogate pairs.

ucs2.decode allows to apply this transformation, but it does not give control over how to deal with code-points that correspond to unmatched surrogate halves or reversed surrogate pairs.
When dealing with unicode, I'd like to detect and eventually react to these situations.

I could roll my own function to check for the "UTF-16 validity" of my strings but if feel that this feature could benefit to many people so I'd like to propose it here.

Concretely, if a string contains unmatched halves or reversed pairs, I propose the following possible behaviors:

Ignore and emit the code-point (current behavior)
Emit the code point of the replacement character � (U+FFFD) instead of the original
Throw an error (For example: new Error("Unmatched surrogate half at index 5"))
Eventually, skip the code-point. I do not really like this way of silently dealing with errors, but it is a possibility.

For backward compatibility, ucs2.decode(string) should still work and default to the first option (emit as-is).
The other behaviors could be added, either in the same function (ucs2.decode) with an optional second argument (like a string enum with one of "ignore", "replace", "throw", "skip") or in a second function ucs2.checkedDecode where the strategy to deal with the errors is required.
I feel that using an optional argument is cleaner, but maybe you prefer to keep ucs2.decode very minimal (potentially throwing errors might hurt the performances) and use a different function that would be allowed to have a higher cost (because the user has to explicitly opt-in by using .checkedDecode).

This is just a feature proposition, please tell me what you think about it. I can send send a PR if there is some interest.

Answer 1 · 2018-01-26T11:55:37.000Z

That's a good idea. I've written such a function (UTF-16 into Unicode), too for my own JavaScript unicode conversion project at http://roker.spamt.net/codeschwein.html (The page is German, but I think it is usable even for people who not speak German ;-))

The UTF-16 decoding is done in DecodeUtf16.convert(), perhaps that code might help you. :-)

Answer 2 · 2018-01-26T18:52:09.000Z

This makes sense as its own project. 👍🏻

It’s not needed for Punycode.js though, so I won’t be making any changes here.

Answer 3 · 2018-01-26T19:08:41.000Z

Thanks for the reply.

I think I'll pull out the ucs2 part out of punycode and publish it as its own package then and focus only on this part.
I have use-cases where I only used the ucs2 decoder so it makes sense to have a smaller package.

Answer 4 · 2018-01-27T08:56:46.000Z

@mathiasbynens : What is not needed for Punycode.js? Input validation, so improper input would not silently produce illegal output?
So I am curious: What is the purpose or a usecase of Punycode.js, where input validation is not necessary?

Answer 5 · 2018-01-27T12:51:39.000Z

Most browsers let users input valid UCS2 that is invalid UTF-16. Having the default to just return these invalid pairs as-is allows to not lose any information. If you replace or skip these code units, it is no longer possible to retrieve the original input.
The purpose of punycode is to implement the RFCs described in the README and offer helpers to manipulate them. Regarding UCS2, it returns you the array of codepoints as displayed by the browsers: calling it "illegal output" is not fair and depends on the context.
I also agree that it would be better to require the developers to opt-in the behavior but it requires good understanding of what's happening. I'd have preferred it to be handled in this lib (hence the issue) but the reply above is fine.