[EUC-JP] U+4FFF(俿) is encoded to IBM拡張文字(8FB1C8) instead of EUC-JP(F9BB)

Question

[EUC-JP] U+4FFF(俿) is encoded to IBM拡張文字(8FB1C8) instead of EUC-JP(F9BB)

mercury233 opened this issue 3 years ago · 3 comments

var iconvLite = require("iconv-lite")
const theChar = String.fromCharCode(0x4FFF);
const theEncodeResult = iconvLite.encode(theChar, 'EUC-JP');
const theDecodeResult1 = iconvLite.decode(Buffer.from([0x8F, 0xB1, 0xC8]), 'EUC-JP');
const theDecodeResult2 = iconvLite.decode(Buffer.from([0xF9, 0xBB]), 'EUC-JP');

console.log(theChar);
console.log(theEncodeResult);
console.log(theDecodeResult1);
console.log('------');
console.log(theDecodeResult2);
console.log(theDecodeResult1 === theDecodeResult2);

https://runkit.com/mercury233/6177adadef03d40008209995

As you can see, both 8FB1C8 and F9BB can be decoded, but it can't be encoded correctly.

Answer 1 · 2021-10-26T15:15:51.000Z

Thanks for the runkit link! I see "俿" is encoded as <8F, B1, C8> (theEncodeResult), what do you mean it can't be encoded correctly? Is this encoding incorrect?

Answer 2 · 2021-10-27T00:34:07.000Z

I know very few about character encoding, and I found the EUC-JP code of "俿" may be F9BB, and iconv-lite do can decode it

Answer 3 · 2021-10-29T22:16:42.000Z

Well, honestly, I don't know much about EUC-JP either :) Current behavior seems reasonable, so I'm not sure what to do here. Let me know if you learn anything more specific (ideally with a link to some kind of standard), I can then reopen the issue. Thanks!