nodejs/node

TextDecoder can't use `latin1` without full-icu

devsnek opened this issue · 8 comments

On at least 12.4.0 and master, new TextDecoder('latin1') fails without full-icu.

We definitely do not need full-icu for latin1, and Buffer can use latin1 without it, so it seems something is not wired up correctly here.

top of head thoughts:

  • iso-8859-1 does not need data to run, it's algorithmic, however
  • the ICU alias table may be elided without full ICU. (Buffer does not use ICU for latin1 conversion as far as i know.)

latin1 is the same as windows-1252 under Encoding Standard’s rules, which does need data to run.

@TimothyGu OK. Yeah, windows-1252 does need a data file. We could change that in ICU I suppose. latin1 by IANA is iso-8859-1, which is kind of a subset (if you ignore C1 controls). ICU expands windows-1252 to ibm-5348_P100-1997 which is “Windows Latin1 (w/ euro update)”

ibm-5348_P100-1997.cnv is 3k and could be included in the 'small icu' package.

How would that work for non-intl builds? TextDecoder falls back to StringDecoder in such builds.

StringDecoder knows about latin1 but not windows-1252 and it farms out the actual act of decoding to V8 using String::ExternalOneByteStringResource so it's not trivial to teach it the differences.

To wit: new StringDecoder('latin1').end(Buffer.from([128])) returns '\u0080'. It should return '\u20AC' (the euro sign) with windows-1252.

string_decoder.cc could hack around that by internally decoding to UTF-8 and then calling String::NewFromUtf8()instead, but that could be problematic for large strings.

Perhaps it's easiest to simply sidestep the ICU/StringDecoder divide and have hard-coded latin1 and windows-1252 byte-to-codepoint tables that are shared between intl and non-intl builds?

For TextDecoder that's pretty trivial to implement. For TextEncoder maybe less so because of partial multi-byte sequences. We could perhaps reuse StringDecoder's logic for dealing with them.

(I'd also be open to moving it into libuv because converting between common encodings comes up in many projects, not just Node.js.)

@bnoordhuis which is correct here?

latin1 is the same as windows-1252
… 
To wit: new StringDecoder('latin1').end(Buffer.from([128])) returns '\u0080'. It should return '\u20AC' (the euro sign) with windows-1252.

Well, the 'old' windows-1252 didn't have the euro sign. I would guess the spec includes the Euro update?

Perhaps it's easiest to simply sidestep the ICU/StringDecoder divide and have hard-coded latin1 and windows-1252 byte-to-codepoint tables that are shared between intl and non-intl builds?

Makes some sense, for specific tables…  are there any besides latin1 (sic) that would be in this category?

v8 may already have such hard coded tables.

@mathiasbynens has a pure JS implementation of windows-1252

Don't believe there's anything further actionable here. Closing. Can reopen if necessary