Inconsistent normalized values for some tags

Question

Inconsistent normalized values for some tags

vvanpo opened this issue 4 years ago · 4 comments

Example with Taiwanese:

$ node -e "console.log(require('bcp-47-normalize')('zh-Hans-TW'))"
zh-TW
$ node -e "console.log(require('bcp-47-normalize')('zh-TW'))"
zh-Hant

So if I'm understanding correctly what this program is supposed to do, it's telling me that zh-TW is both the normal form of the tag that includes the 'Hans' script, and is "further normalized" down to the 'Hant' script?

Answer 1 · 2020-07-24T10:34:52.000Z

Seems like a bug. See here and here for the data. The last link includes zh_Hans, which seems why zh-Hans-TW incorrectly goes to zh-TW.

I wonder if zh-Hans-TW should go to zh-Hans, there is no data to suggest that I can quickly see.

Answer 2 · 2020-07-24T13:10:23.000Z

Thanks for reporting, @vvanpo, released in 1.1.0!

Answer 3 · 2020-08-07T04:53:09.000Z

@wooorm This fix will result in zh-CN becoming zh, and lots of other normalization change. Is there a reason for this? Should it be marked as a BREAKING CHANGE instead?

https://npm.runkit.com/bcp-47-normalize

var bcp47Normalize = require("bcp-47-normalize")

console.log(bcp47Normalize('zh-CN'));
console.log(bcp47Normalize('zh-TW'));
console.log(bcp47Normalize('zh-MO'));
console.log(bcp47Normalize('zh-HK'));

"zh"
"zh-Hant"
"zh-Hant-MO"
"zh-Hant-HK"

Answer 4 · 2020-08-07T06:48:36.000Z

Yup, that’s the goal of normalizing. Chinese as spoken in China, well, the as spoken in China part is implied.

These four all go through here: https://github.com/unicode-org/cldr/blob/4b1225ead2ca9bc7a969a271b9931f137040d2bf/common/supplemental/supplementalMetadata.xml#L177

And then a couple of them are defaults: https://github.com/unicode-org/cldr/blob/4b1225ead2ca9bc7a969a271b9931f137040d2bf/common/supplemental/supplementalMetadata.xml#L1539

I’d normally consider it breaking, but the previous behavior was broken.