Consider adding TextEncoder.containsLoneSurrogates() static

Question

Consider adding TextEncoder.containsLoneSurrogates() static

annevk opened this issue 6 years ago · 41 comments

This came up in in the IDL for Wasm discussion at WebAssembly/interface-types#21 (comment) (bit before and after is also relevant context) and @Pauan suggested it might be faster than rustwasm/wasm-bindgen#1348 (comment).

cc @alexcrichton @lukewagner

Answer 1 · 2019-04-02T11:45:41.000Z

I was thinking of a different suggestion instead: adding fatal option to TextEncoder.

Currently it feels incosistent that TextDecoder has a mode to strictly check UTF-8 instead of producing replacement characters, but TextEncoder doesn't and produces them unconditionally, breaking safe roundtrip.

Answer 2 · 2019-04-02T11:52:32.000Z

It's a little different though. The contract of an encoding is to provide a mapping between a byte sequence and a scalar value sequence. In particular this excludes a code point sequence.

From that perspective it makes more sense to defer that concern to a higher layer. If we don't want to do that anymore, we'd also have to support streaming I think, in case your input ends with a lone surrogate, but a subsequent call might provide the other lone surrogate that together form a scalar value.

Answer 3 · 2019-04-02T12:09:33.000Z

If we don't want to do that anymore, we'd also have to support streaming I think, in case your input ends with a lone surrogate, but a subsequent call might provide the other lone surrogate that together form a scalar value.

We already don't support streaming [well], because currently this will result in two replacement characters. It seems better to throw explicitly on such cases, and, as you said, defer handling of streaming to the upper layer.

Answer 4 · 2019-04-02T12:13:48.000Z

I'm trying to say that if we want to support code point sequence as an input, and add fatal handling for lone surrogates, we also need to handle streaming them for consistency. Or we defer both problems to a higher level as the current design tries to.

Answer 5 · 2019-04-02T12:25:36.000Z

I'm trying to say that if we want to support code point sequence as an input

Ah I think now I understand what you're trying to say - that right now TextEncoder is not even intended to support UTF-16 code points at all and that conversion to replacement chars is actually a side effect and not intended behaviour of the encoder itself?

Answer 6 · 2019-04-02T12:40:40.000Z

Correct, it happens at the IDL layer due to USVString.

Answer 7 · 2019-04-02T12:47:20.000Z

I see. I think I'd be still in favour of handling them in TextEncoder too, given how often it's going to be used with JS strings, even if it means adding support for streaming as well.

Otherwise with something like containsLoneSurrogates one would have to go through the string twice, which might have undesirable performance effects on large strings.

Answer 8 · 2019-04-02T14:42:30.000Z

Firefox has SIMD-accelerated code for checking if a UTF-16 string is valid UTF-16. I guess it makes sense to expose it to the Web. It's less clear that TextEncoder is the right entry point as opposed to JS strings themselves. I can see how it would be more expedient to standardize a method at the WHATWG than at TC39, but it seems to me it ideally should be a JS string method.

Firefox currently doesn't have a mechanism to signal errors for unpaired surrogates in UTF-16 to UTF-8 conversion, because previously there has not been a use case for such detection. So implementing that approach would involve writing some new (non-glue) code.

I'd like to see if the event issue can be fixed quickly in browsers.

Answer 9 · 2019-04-02T14:52:28.000Z

I'd like to see if the event issue can be fixed quickly in browsers.

Also, for the event issue, both checking the whole string for lone surrogates and making UTF-16 to UTF-8 conversion report unpaired surrogates are overkills: It's enough to check if the last code unit of the JS string is an unpaired surrogate.

Answer 10 · 2019-04-02T14:55:18.000Z

It's less clear that TextEncoder is the right entry point as opposed to JS strings themselves.

It would be symmetrical to TextDecoder though. Even if JS gains own support for string encoding/decoding in the future, it seems useful to have this level of consistency between existing APIs in WHATWG Encoding.

because previously there has not been a use case for such detection

Sure, but I think that this usecase (asking for only valid UTF-8 output) is common enough to warrant an implementation if/when API is agreed on. Firefox can use the existing SIMD-based check implementation internally, but I don't think that the actual shape of public-facing APIs should be driven by existing internal implementations...

Answer 11 · 2019-04-02T15:08:40.000Z

Sure, but I think that this usecase (asking for only valid UTF-8 output) is common enough

s/valid UTF-8/lossless UTF-8/. I don't think it has been demonstrated that asking for losslessness is going to be a common use case. So far, experience with the Servo style engine in particular suggests that the Web, despite Hyrum's Law generally ruling everything, is remarkably free of relying on unpaired surrogates.

I don't think that the actual shape of public-facing APIs should be driven by existing internal implementations...

It shouldn't. I'm just mentioning the implementation distance for particular approaches.

Answer 12 · 2019-04-02T15:13:18.000Z

Also, regarding whether this is going to be common, I think it's a notable data point that Firefox doesn't need error-signaling UTF-16 to UTF-8 conversion internally. (It does have error-signaling potentially-invalid UTF-16 to potentially-invalid UTF-8 comparison operation, but maybe it could get away with not having even that one.)

Answer 13 · 2019-04-02T15:43:47.000Z

I don't think it has been demonstrated that asking for losslessness is going to be a common use case.

I'm somewhat worried that the reason is that exactly that alternative is not exposed in the API and developers are not aware that it's currently lossy by default. (I personally wasn't aware of this until running into this issue.)

Answer 14 · 2019-04-02T22:00:48.000Z

@RReverser I was thinking of a different suggestion instead: adding fatal option to TextEncoder.

I think that would be a good thing to have, but that's orthogonal to this, because a fatal option will not help us.

We want to completely ignore strings which contain unpaired surrogates, so we definitely don't want runtime exceptions!

Otherwise with something like containsLoneSurrogates one would have to go through the string twice, which might have undesirable performance effects on large strings.

Indeed, I agree that is very unfortunate.

But for our use case we still wouldn't want fatal, instead we would want something else, such as an API which returns the encoded Uint8Array (if the string is valid) or null (if the string contains unpaired surrogates). That would avoid the double iteration.

@hsivonen So far, experience with the Servo style engine in particular suggests that the Web, despite Hyrum's Law generally ruling everything, is remarkably free of relying on unpaired surrogates.

Web code may not rely upon unpaired surrogates per se, but they definitely rely on the ability for invalid JS strings to roundtrip correctly (e.g. the double input event bug).

Unfortunately when encoding to UTF-8 you lose the ability to roundtrip.

Answer 15 · 2019-04-02T22:15:47.000Z

But for our use case we still wouldn't want fatal, instead we would want something else, such as an API which returns the encoded Uint8Array (if the string is valid) or null (if the string contains unpaired surrogates). That would avoid the double iteration.

@Pauan Hmm, that doesn't sound really orthogonal to my fatal suggestion. What you're describing sounds more like a simple error handling of the result:

let result;
try {
  result = strictEncoder.encode(someString);
} catch {
  result = null;
}
// here `result` is what you're describing, still without separate API or going through the string twice

Answer 16 · 2019-04-02T22:44:05.000Z

@RReverser Catching exceptions is about ~135 times slower than returning null, so we'd really like to avoid that.

And you might have to put in extra checking to make sure that you're catching the right exception (and not some unrelated exception).

It's quite common for DOM APIs to return null on failure.

Answer 17 · 2019-04-02T23:43:30.000Z

Catching exceptions is about ~135 times slower than returning null, so we'd really like to avoid that.

wasm-bindgen already does this try-catch in lots of places, including inside existing passStringToWasm, and tbh at that order of magnitude 135x is not that big.

It's quite common for DOM APIs to return null on failure.

Which APIs are you refering to? Usually they throw DOMError. Also that sounds rather inconsistent with TextDecoder API.

Answer 18 · 2019-04-03T01:06:07.000Z

wasm-bindgen already does this try-catch in lots of places

Yes, but we plan to move away from that (especially in areas where we know errors cannot occur).

Our goal is to be as fast as possible (zero-cost ideally), at least as fast as JS, and in the future faster-than-JS.

Which APIs are you refering to?

getElementById can return null
querySelector can return null
item can return null
namedItem can return null
lookupPrefix can return null
getAttribute / getAttributeNS can return null
closest can return null

I admit they're not quite the same as TextEncoder, but there is some precedence for returning null

Answer 19 · 2019-04-03T05:06:12.000Z

I'm somewhat worried that the reason is that exactly that alternative is not exposed in the API and developers are not aware that it's currently lossy by default.

I'm not saying that it's unimportant, because the Web Platform doesn't have at already. I'm saying that Firefox is a large codebase that deals with UTF-16 and UTF-8 and gets away with not having it internally. It's not conclusive, but it's a data point.

(I personally wasn't aware of this until running into this issue.)

The input event issue is a problem, but I think one problem shouldn't be extrapolated into an assumption that it's going to be a common case.

Answer 20 · 2019-04-03T07:26:15.000Z

Web code may not rely upon unpaired surrogates per se, but they definitely rely on the ability for invalid JS strings to roundtrip correctly (e.g. the double input event bug).

Do you have data point other than the double input event bug? Indeed, I think it's better to analyze it as a bug in Chrome's and Firefox's Windows input integration than to analyze it as a fundamental unpaired surrogate dependency of the Web Platform.

Notably, the new (since Firefox 57) CSS OM in Firefox does not round-trip unpaired surrogates, and that hasn't been a problem.

Answer 21 · 2019-04-03T09:04:23.000Z

Our goal is to be as fast as possible (zero-cost ideally), at least as fast as JS, and in the future faster-than-JS.

@Pauan Sure, but the error case is rare and not something I'd optimise for, while the success case should be fast with either approach and dominated by the string encoding performance and not error handling.

In this case having idiomatic JS APIs feels more important than relatively small speed improvements, especially since in the future WASM might switch to some host bindings encoding approach, while the API is here to stay for JS developers too.

I admit they're not quite the same as TextEncoder, but there is some precedence for returning null

Yeah, but all of these are not really failures (like Result), but more like Options where "doesn't exist" is a valid and common state and not an error.

Answer 22 · 2019-04-03T09:08:07.000Z

The input event issue is a problem, but I think one problem shouldn't be extrapolated into an assumption that it's going to be a common case.

@hsivonen I actually wasn't referring to the input bug - I don't often work with UIs, so it's less interesting for me - but more to the encoding of arbitrary JS strings to UTF-8 in general, both in browser and Node.js (which both have TextEncoder).

Answer 23 · 2019-04-03T09:14:14.000Z

@RReverser in what scenarios did you encounter issues? It would help to have specific examples.

Answer 24 · 2019-04-04T20:54:10.000Z

@hsivonen The input event issue is a problem, but I think one problem shouldn't be extrapolated into an assumption that it's going to be a common case.

Indeed, I think it's better to analyze it as a bug in Chrome's and Firefox's Windows input integration than to analyze it as a fundamental unpaired surrogate dependency of the Web Platform.

That's a fair perspective. I also hadn't ever encountered unpaired surrogates until this issue.

I think we still need to have support for unpaired surrogates in Rust (for completeness and thoroughness).

However, if the input event bug is fixed, then our need for TextEncoder.containsLoneSurrogates diminishes dramatically, since we can just use the JsString::iter method to preserve unpaired surrogates (and JsString::is_valid_utf16 to check for unpaired surrogates).

This is pretty slow, but since unpaired surrogates are almost nonexistent, I think that's acceptable.

@RReverser while the success case should be fast with either approach and dominated by the string encoding performance and not error handling

For many years try/catch had a huge performance cost even in the success case (in fact, it completely turned off all JIT optimizations for any functions that used it!)

However, I just now tested it, and was pleasantly surprised that you are right: try/catch only has a ~25% performance cost in the success case. I think that is acceptable for us.

In any case, since it looks like the input bug will be fixed directly in browsers, we no longer need TextEncoder.containsLoneSurrogates, so I guess my point is moot.

Answer 25 · 2019-04-04T22:00:59.000Z

For many years try/catch had a huge performance cost even in the success case

Oh yeah I know, but it's been fixed in V8 for ~1.5 years now :)

Answer 26 · 2019-04-04T22:13:09.000Z

in what scenarios did you encounter issues? It would help to have specific examples.

@annevk Sorry I missed this!

I can't point to a specific issue because, as I said, until recently I haven't even realised I have one :) But these days I'm working on a mixed JS + WASM parser where JS API can be send strings to WASM and get ASTs with parsed strings backand the opposite direction for the codegen.

This is just one particular usecase, but both in this case and in any other that involves binary formats, native code or WASM, it feels better to have explicit detection and reporting of any errors instead of silently changing them to something else.

As mentioned above, this feels very similar to TextDecoder + fatal both in terms of usecases and desired behaviour.

Answer 27 · 2019-04-08T08:27:46.000Z

With lookbehind support in JavaScript regular expressions, this functionality can trivially be implemented in userland:

function isWellFormed(string) {
  const reLoneSurrogate = /[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]/;
  return !reLoneSurrogate.test(string);
}

With the u flag it's even simpler:

function isWellFormed(string) {
  const reLoneSurrogate = /\p{Surrogate}/u;
  return !reLoneSurrogate.test(string);
}

That might explain why it hadn't been considered for standardization before. Still, if this comes up often enough in standards discussions, that might be sufficient motivation to add something like a static String.isWellFormed method to the language.

Answer 28 · 2019-04-08T11:25:14.000Z

With lookbehind support in JavaScript regular expressions, this functionality can trivially be implemented in userland

Sure, there's lots of ways to implement it (such as this).

The reason for this suggestion was for performance, not functionality (because the browser can implement it faster).

The reason we cared about performance is because this function would be called a lot (every time a string is sent from JS to Rust).

However, we managed to narrow our use case down to only checking <input>, and after the input bug is fixed we won't need to check anything, so we no longer care much about the performance.

Answer 29 · 2019-04-08T11:39:36.000Z

cc @gibson042

Answer 30 · 2019-04-08T11:41:14.000Z

However, we managed to narrow our use case down to only checking , and after the input bug is fixed we won't need to check anything, so we no longer care much about the performance.

Why? There's still plenty of places strings with lone surrogates can come from (as mentioned above).

Answer 31 · 2019-04-08T13:06:27.000Z

There's still plenty of places strings with lone surrogates can come from (as mentioned above).

Are there? We've only encountered them with <input>. Could you give some concrete examples and explanation of how unpaired surrogates can occur? (Your examples above seemed pretty vague to me)

To be clear, unpaired surrogates are invalid UTF-16, and so anything that generates them is (in my opinion) a bug.

If you want to deal with binary data, you should be using something else like Uint8Array, not invalid JS strings.

Answer 32 · 2019-04-08T16:06:12.000Z

To be clear, unpaired surrogates are invalid UTF-16, and so anything that generates them is (in my opinion) a bug.

JavaScript doesn't claim UTF-16 compatibility though, so it's not really a bug, but rather part of the language and so such strings should be taken into account in any APIs that interact with JS IMO (especially in encoders/decoders).

As mentioned above, I'm usually working with parsers in JavaScript. However, even without going into parsers that parse JavaScript itself, the simplest example can be created even with built-in JSON.parse API:

> JSON.parse(String.raw`"\uD800"`)
"\ud800"

Here the input is perfectly valid UTF-8 / UTF-16, but the output is not, and has to be checked whether when passing string to WASM or encoding it to the disk, otherwise you're risking silent data loss that is not trivial to debug. (It's not hard to imagine user calling JSON::parse from Rust with arbitrary strings received from elsewhere.)

Hence my suggestion to enhance TextEncoder to check for lone surrogates automatically and throw the error before even crossing the boundary or losing the data.

Answer 33 · 2019-04-08T16:56:01.000Z

I think I'm convinced that adding this to String in some manner is the way to go if it's needed. Encodings require UTF strings as input and only output UTF strings. And TextEncoder isn't the only API that has such a requirement and acts in this "lossy" way. Every API that uses USVString does this, so we'd need a generic solution, which makes this an input problem, not a specific API problem.

Answer 34 · 2019-04-08T17:00:18.000Z

JavaScript doesn't claim UTF-16 compatibility though, so it's not really a bug, but rather part of the language

I am very aware of that. That doesn't change the fact that breaking UTF-16 is a really bad idea, which is why I said that it is (in my opinion) a bug.

To be clear, I'm saying it's a bug in the application. Whether it's a bug in the language or not is a different question.

And as @hsivonen has said, based on Firefox's experience the web generally doesn't generate or interact with unpaired surrogates (since pretty much every JS API never produces unpaired surrogates).

So even though technically JS isn't UTF-16, in practice it is, because nobody actually generates unpaired surrogates.

Here the input is perfectly valid UTF-8 / UTF-16

I think that's debatable.

Sure, from the perspective of the consumer, before they run JSON.parse it appears to be valid UTF-16.

But from the perspective of the producer, they had a string which was invalid UTF-16, and then they called JSON.stringify (or similar) on it, and sent it to the consumer.

So I would say that that is a bug in the producer, since they should have never created an invalid string in the first place.

Basically, except in contrived examples, somebody messed up and generated an invalid string. And so it's their responsibility to fix that.

So I'm asking for non-contrived examples of where unpaired surrogates were generated.

Hence my suggestion to enhance TextEncoder to check for lone surrogates automatically and throw the error before even crossing the boundary or losing the data.

Like I said earlier, I think that's a good idea, but it's orthogonal to TextEncoder.containsLoneSurrogate, so I think you should advocate for that in a new issue.

Answer 35 · 2019-04-08T17:54:31.000Z

So I would say that that is a bug in the producer, since they should have never created an invalid string in the first place.

Maybe, but pointing elsewhere doesn't exactly make debugging any easier (rather proves my point that, in the absence of a checked API, the debugging of such cases and finding out the root cause can quickly become a hell).

I think I'm convinced that adding this to String in some manner is the way to go if it's needed.

I don't mind moving this to String land.

It's just TextEncoder seemed like the best place given that it's currently the central point for dealing with string encoding and, as mentioned above, already has a very similar API and mechanism in the opposite direction.

At the same time, JS on its own doesn't care about interop with external APIs or binary formats and traditionally works with lone surrogates in all built-in APIs.

Answer 36 · 2019-04-09T06:42:27.000Z

otherwise you're risking silent data loss that is not trivial to debug

I think the notion of "data loss" is overblown. Unpaired surrogates don't have any semantics, so in that sense replacing an unpaired surrogate doesn't lose any meaning from the string.

However, it is imaginable that it could be harmful for the equality relation of (erroneous) strings to differ on different sides of a boundary, so that two strings compare unequal in JS but equal in Wasm. See a similar-themed but different Chrome bug where the bug was caused by two UTF-8 decoders generating a different number of U+FFFDs for erroneous byte sequences.

Round-tripping equality is the main issue for being able to round-trip file paths on Windows in Rust. The Web Platform is somewhat different in the sense that there aren't similar places where the platform gives you unpaired surrogates in a non-bug way (either on its own or giving you unpaired surrogates persisted by someone else) and expects their equality to be preserved. Specification-wise the Web Platform tries hard not to hand you unpaired surrogates. However, if you yourself create unpaired surrogates, the Web Platform most of the time preserves them for the duration of the execution of a JS program.

Answer 37 · 2019-04-09T07:45:30.000Z

@RReverser what's the JS assertion based on? E.g., https://tc39.github.io/ecma262/#sec-encode throws on lone surrogates afaict.

Answer 38 · 2020-05-06T12:44:12.000Z

For the original use case here, I think it would still be preferable to get the Gecko and Chromium keyboard event bugs fixed instead of adding facilities for optimizing workarounds. It would be good to do so before EdgeHTML fades away, as it now keeps the Web from assuming the Gecko/Chromium bug.

String.isWellFormed() makes sense if there are enough use cases where the action taken in the ill-formed case is something other than making the string well-formed. If the action is making the string well-formed, it would make more sense to have a method that does that: String.makeWellFormed() that returns this if well-formed and a new String with lone surrogates replaced with the REPLACEMENT CHARACTER otherwise.

Answer 39 · 2020-05-06T13:20:52.000Z

if there are enough use cases where the action taken in the ill-formed case is something other than making the string well-formed

I think more often than not, it would be used to throw a validation error.

Answer 40 · 2020-10-23T12:01:42.000Z

Chatting with some TC39 folks it seems there would be support for well-formed string APIs if someone did the work. It's not clear to me we need to track that here so I'm inclined to close this, but will give it some time in case people feel otherwise.

Answer 41 · 2021-08-20T06:23:29.000Z

Draft ECMAScript proposal by @guybedford: https://github.com/guybedford/proposal-is-usv-string