Consider adding TextEncoder.containsLoneSurrogates() static
annevk opened this issue ยท 41 comments
This came up in in the IDL for Wasm discussion at WebAssembly/interface-types#21 (comment) (bit before and after is also relevant context) and @Pauan suggested it might be faster than rustwasm/wasm-bindgen#1348 (comment).
I was thinking of a different suggestion instead: adding fatal
option to TextEncoder
.
Currently it feels incosistent that TextDecoder
has a mode to strictly check UTF-8 instead of producing replacement characters, but TextEncoder
doesn't and produces them unconditionally, breaking safe roundtrip.
It's a little different though. The contract of an encoding is to provide a mapping between a byte sequence and a scalar value sequence. In particular this excludes a code point sequence.
From that perspective it makes more sense to defer that concern to a higher layer. If we don't want to do that anymore, we'd also have to support streaming I think, in case your input ends with a lone surrogate, but a subsequent call might provide the other lone surrogate that together form a scalar value.
If we don't want to do that anymore, we'd also have to support streaming I think, in case your input ends with a lone surrogate, but a subsequent call might provide the other lone surrogate that together form a scalar value.
We already don't support streaming [well], because currently this will result in two replacement characters. It seems better to throw explicitly on such cases, and, as you said, defer handling of streaming to the upper layer.
I'm trying to say that if we want to support code point sequence as an input, and add fatal handling for lone surrogates, we also need to handle streaming them for consistency. Or we defer both problems to a higher level as the current design tries to.
I'm trying to say that if we want to support code point sequence as an input
Ah I think now I understand what you're trying to say - that right now TextEncoder
is not even intended to support UTF-16 code points at all and that conversion to replacement chars is actually a side effect and not intended behaviour of the encoder itself?
Correct, it happens at the IDL layer due to USVString
.
I see. I think I'd be still in favour of handling them in TextEncoder
too, given how often it's going to be used with JS strings, even if it means adding support for streaming as well.
Otherwise with something like containsLoneSurrogates
one would have to go through the string twice, which might have undesirable performance effects on large strings.
Firefox has SIMD-accelerated code for checking if a UTF-16 string is valid UTF-16. I guess it makes sense to expose it to the Web. It's less clear that TextEncoder
is the right entry point as opposed to JS strings themselves. I can see how it would be more expedient to standardize a method at the WHATWG than at TC39, but it seems to me it ideally should be a JS string method.
Firefox currently doesn't have a mechanism to signal errors for unpaired surrogates in UTF-16 to UTF-8 conversion, because previously there has not been a use case for such detection. So implementing that approach would involve writing some new (non-glue) code.
I'd like to see if the event issue can be fixed quickly in browsers.
I'd like to see if the event issue can be fixed quickly in browsers.
Also, for the event issue, both checking the whole string for lone surrogates and making UTF-16 to UTF-8 conversion report unpaired surrogates are overkills: It's enough to check if the last code unit of the JS string is an unpaired surrogate.
It's less clear that TextEncoder is the right entry point as opposed to JS strings themselves.
It would be symmetrical to TextDecoder though. Even if JS gains own support for string encoding/decoding in the future, it seems useful to have this level of consistency between existing APIs in WHATWG Encoding.
because previously there has not been a use case for such detection
Sure, but I think that this usecase (asking for only valid UTF-8 output) is common enough to warrant an implementation if/when API is agreed on. Firefox can use the existing SIMD-based check implementation internally, but I don't think that the actual shape of public-facing APIs should be driven by existing internal implementations...
Sure, but I think that this usecase (asking for only valid UTF-8 output) is common enough
s/valid UTF-8/lossless UTF-8/. I don't think it has been demonstrated that asking for losslessness is going to be a common use case. So far, experience with the Servo style engine in particular suggests that the Web, despite Hyrum's Law generally ruling everything, is remarkably free of relying on unpaired surrogates.
I don't think that the actual shape of public-facing APIs should be driven by existing internal implementations...
It shouldn't. I'm just mentioning the implementation distance for particular approaches.
Also, regarding whether this is going to be common, I think it's a notable data point that Firefox doesn't need error-signaling UTF-16 to UTF-8 conversion internally. (It does have error-signaling potentially-invalid UTF-16 to potentially-invalid UTF-8 comparison operation, but maybe it could get away with not having even that one.)
I don't think it has been demonstrated that asking for losslessness is going to be a common use case.
I'm somewhat worried that the reason is that exactly that alternative is not exposed in the API and developers are not aware that it's currently lossy by default. (I personally wasn't aware of this until running into this issue.)
@RReverser I was thinking of a different suggestion instead: adding fatal option to TextEncoder.
I think that would be a good thing to have, but that's orthogonal to this, because a fatal option will not help us.
We want to completely ignore strings which contain unpaired surrogates, so we definitely don't want runtime exceptions!
Otherwise with something like containsLoneSurrogates one would have to go through the string twice, which might have undesirable performance effects on large strings.
Indeed, I agree that is very unfortunate.
But for our use case we still wouldn't want fatal
, instead we would want something else, such as an API which returns the encoded Uint8Array
(if the string is valid) or null
(if the string contains unpaired surrogates). That would avoid the double iteration.
@hsivonen So far, experience with the Servo style engine in particular suggests that the Web, despite Hyrum's Law generally ruling everything, is remarkably free of relying on unpaired surrogates.
Web code may not rely upon unpaired surrogates per se, but they definitely rely on the ability for invalid JS strings to roundtrip correctly (e.g. the double input
event bug).
Unfortunately when encoding to UTF-8 you lose the ability to roundtrip.
But for our use case we still wouldn't want fatal, instead we would want something else, such as an API which returns the encoded Uint8Array (if the string is valid) or null (if the string contains unpaired surrogates). That would avoid the double iteration.
@Pauan Hmm, that doesn't sound really orthogonal to my fatal
suggestion. What you're describing sounds more like a simple error handling of the result:
let result;
try {
result = strictEncoder.encode(someString);
} catch {
result = null;
}
// here `result` is what you're describing, still without separate API or going through the string twice
@RReverser Catching exceptions is about ~135 times slower than returning null
, so we'd really like to avoid that.
And you might have to put in extra checking to make sure that you're catching the right exception (and not some unrelated exception).
It's quite common for DOM APIs to return null
on failure.
Catching exceptions is about ~135 times slower than returning null, so we'd really like to avoid that.
wasm-bindgen already does this try-catch in lots of places, including inside existing passStringToWasm
, and tbh at that order of magnitude 135x is not that big.
It's quite common for DOM APIs to return null on failure.
Which APIs are you refering to? Usually they throw DOMError
. Also that sounds rather inconsistent with TextDecoder
API.
wasm-bindgen already does this try-catch in lots of places
Yes, but we plan to move away from that (especially in areas where we know errors cannot occur).
Our goal is to be as fast as possible (zero-cost ideally), at least as fast as JS, and in the future faster-than-JS.
Which APIs are you refering to?
getElementById
can returnnull
querySelector
can returnnull
item
can returnnull
namedItem
can returnnull
lookupPrefix
can returnnull
getAttribute
/getAttributeNS
can returnnull
closest
can returnnull
I admit they're not quite the same as TextEncoder
, but there is some precedence for returning null
I'm somewhat worried that the reason is that exactly that alternative is not exposed in the API and developers are not aware that it's currently lossy by default.
I'm not saying that it's unimportant, because the Web Platform doesn't have at already. I'm saying that Firefox is a large codebase that deals with UTF-16 and UTF-8 and gets away with not having it internally. It's not conclusive, but it's a data point.
(I personally wasn't aware of this until running into this issue.)
The input event issue is a problem, but I think one problem shouldn't be extrapolated into an assumption that it's going to be a common case.
Web code may not rely upon unpaired surrogates per se, but they definitely rely on the ability for invalid JS strings to roundtrip correctly (e.g. the double input event bug).
Do you have data point other than the double input event bug? Indeed, I think it's better to analyze it as a bug in Chrome's and Firefox's Windows input integration than to analyze it as a fundamental unpaired surrogate dependency of the Web Platform.
Notably, the new (since Firefox 57) CSS OM in Firefox does not round-trip unpaired surrogates, and that hasn't been a problem.
Our goal is to be as fast as possible (zero-cost ideally), at least as fast as JS, and in the future faster-than-JS.
@Pauan Sure, but the error case is rare and not something I'd optimise for, while the success case should be fast with either approach and dominated by the string encoding performance and not error handling.
In this case having idiomatic JS APIs feels more important than relatively small speed improvements, especially since in the future WASM might switch to some host bindings encoding approach, while the API is here to stay for JS developers too.
I admit they're not quite the same as TextEncoder, but there is some precedence for returning null
Yeah, but all of these are not really failures (like Result
), but more like Option
s where "doesn't exist" is a valid and common state and not an error.
The input event issue is a problem, but I think one problem shouldn't be extrapolated into an assumption that it's going to be a common case.
@hsivonen I actually wasn't referring to the input bug - I don't often work with UIs, so it's less interesting for me - but more to the encoding of arbitrary JS strings to UTF-8 in general, both in browser and Node.js (which both have TextEncoder
).
@RReverser in what scenarios did you encounter issues? It would help to have specific examples.
@hsivonen The input event issue is a problem, but I think one problem shouldn't be extrapolated into an assumption that it's going to be a common case.
Indeed, I think it's better to analyze it as a bug in Chrome's and Firefox's Windows input integration than to analyze it as a fundamental unpaired surrogate dependency of the Web Platform.
That's a fair perspective. I also hadn't ever encountered unpaired surrogates until this issue.
I think we still need to have support for unpaired surrogates in Rust (for completeness and thoroughness).
However, if the input
event bug is fixed, then our need for TextEncoder.containsLoneSurrogates
diminishes dramatically, since we can just use the JsString::iter
method to preserve unpaired surrogates (and JsString::is_valid_utf16
to check for unpaired surrogates).
This is pretty slow, but since unpaired surrogates are almost nonexistent, I think that's acceptable.
@RReverser while the success case should be fast with either approach and dominated by the string encoding performance and not error handling
For many years try
/catch
had a huge performance cost even in the success case (in fact, it completely turned off all JIT optimizations for any functions that used it!)
However, I just now tested it, and was pleasantly surprised that you are right: try
/catch
only has a ~25% performance cost in the success case. I think that is acceptable for us.
In any case, since it looks like the input
bug will be fixed directly in browsers, we no longer need TextEncoder.containsLoneSurrogates
, so I guess my point is moot.
For many years try/catch had a huge performance cost even in the success case
Oh yeah I know, but it's been fixed in V8 for ~1.5 years now :)
in what scenarios did you encounter issues? It would help to have specific examples.
@annevk Sorry I missed this!
I can't point to a specific issue because, as I said, until recently I haven't even realised I have one :) But these days I'm working on a mixed JS + WASM parser where JS API can be send strings to WASM and get ASTs with parsed strings backand the opposite direction for the codegen.
This is just one particular usecase, but both in this case and in any other that involves binary formats, native code or WASM, it feels better to have explicit detection and reporting of any errors instead of silently changing them to something else.
As mentioned above, this feels very similar to TextDecoder
+ fatal
both in terms of usecases and desired behaviour.
With lookbehind support in JavaScript regular expressions, this functionality can trivially be implemented in userland:
function isWellFormed(string) {
const reLoneSurrogate = /[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]/;
return !reLoneSurrogate.test(string);
}
With the u
flag it's even simpler:
function isWellFormed(string) {
const reLoneSurrogate = /\p{Surrogate}/u;
return !reLoneSurrogate.test(string);
}
That might explain why it hadn't been considered for standardization before. Still, if this comes up often enough in standards discussions, that might be sufficient motivation to add something like a static String.isWellFormed
method to the language.
With lookbehind support in JavaScript regular expressions, this functionality can trivially be implemented in userland
Sure, there's lots of ways to implement it (such as this).
The reason for this suggestion was for performance, not functionality (because the browser can implement it faster).
The reason we cared about performance is because this function would be called a lot (every time a string is sent from JS to Rust).
However, we managed to narrow our use case down to only checking <input>
, and after the input
bug is fixed we won't need to check anything, so we no longer care much about the performance.
cc @gibson042
However, we managed to narrow our use case down to only checking , and after the input bug is fixed we won't need to check anything, so we no longer care much about the performance.
Why? There's still plenty of places strings with lone surrogates can come from (as mentioned above).
There's still plenty of places strings with lone surrogates can come from (as mentioned above).
Are there? We've only encountered them with <input>
. Could you give some concrete examples and explanation of how unpaired surrogates can occur? (Your examples above seemed pretty vague to me)
To be clear, unpaired surrogates are invalid UTF-16, and so anything that generates them is (in my opinion) a bug.
If you want to deal with binary data, you should be using something else like Uint8Array
, not invalid JS strings.
To be clear, unpaired surrogates are invalid UTF-16, and so anything that generates them is (in my opinion) a bug.
JavaScript doesn't claim UTF-16 compatibility though, so it's not really a bug, but rather part of the language and so such strings should be taken into account in any APIs that interact with JS IMO (especially in encoders/decoders).
As mentioned above, I'm usually working with parsers in JavaScript. However, even without going into parsers that parse JavaScript itself, the simplest example can be created even with built-in JSON.parse
API:
> JSON.parse(String.raw`"\uD800"`)
"\ud800"
Here the input is perfectly valid UTF-8 / UTF-16, but the output is not, and has to be checked whether when passing string to WASM or encoding it to the disk, otherwise you're risking silent data loss that is not trivial to debug. (It's not hard to imagine user calling JSON::parse
from Rust with arbitrary strings received from elsewhere.)
Hence my suggestion to enhance TextEncoder
to check for lone surrogates automatically and throw the error before even crossing the boundary or losing the data.
I think I'm convinced that adding this to String
in some manner is the way to go if it's needed. Encodings require UTF strings as input and only output UTF strings. And TextEncoder
isn't the only API that has such a requirement and acts in this "lossy" way. Every API that uses USVString
does this, so we'd need a generic solution, which makes this an input problem, not a specific API problem.
JavaScript doesn't claim UTF-16 compatibility though, so it's not really a bug, but rather part of the language
I am very aware of that. That doesn't change the fact that breaking UTF-16 is a really bad idea, which is why I said that it is (in my opinion) a bug.
To be clear, I'm saying it's a bug in the application. Whether it's a bug in the language or not is a different question.
And as @hsivonen has said, based on Firefox's experience the web generally doesn't generate or interact with unpaired surrogates (since pretty much every JS API never produces unpaired surrogates).
So even though technically JS isn't UTF-16, in practice it is, because nobody actually generates unpaired surrogates.
Here the input is perfectly valid UTF-8 / UTF-16
I think that's debatable.
Sure, from the perspective of the consumer, before they run JSON.parse
it appears to be valid UTF-16.
But from the perspective of the producer, they had a string which was invalid UTF-16, and then they called JSON.stringify
(or similar) on it, and sent it to the consumer.
So I would say that that is a bug in the producer, since they should have never created an invalid string in the first place.
Basically, except in contrived examples, somebody messed up and generated an invalid string. And so it's their responsibility to fix that.
So I'm asking for non-contrived examples of where unpaired surrogates were generated.
Hence my suggestion to enhance TextEncoder to check for lone surrogates automatically and throw the error before even crossing the boundary or losing the data.
Like I said earlier, I think that's a good idea, but it's orthogonal to TextEncoder.containsLoneSurrogate
, so I think you should advocate for that in a new issue.
So I would say that that is a bug in the producer, since they should have never created an invalid string in the first place.
Maybe, but pointing elsewhere doesn't exactly make debugging any easier (rather proves my point that, in the absence of a checked API, the debugging of such cases and finding out the root cause can quickly become a hell).
I think I'm convinced that adding this to String in some manner is the way to go if it's needed.
I don't mind moving this to String
land.
It's just TextEncoder
seemed like the best place given that it's currently the central point for dealing with string encoding and, as mentioned above, already has a very similar API and mechanism in the opposite direction.
At the same time, JS on its own doesn't care about interop with external APIs or binary formats and traditionally works with lone surrogates in all built-in APIs.
otherwise you're risking silent data loss that is not trivial to debug
I think the notion of "data loss" is overblown. Unpaired surrogates don't have any semantics, so in that sense replacing an unpaired surrogate doesn't lose any meaning from the string.
However, it is imaginable that it could be harmful for the equality relation of (erroneous) strings to differ on different sides of a boundary, so that two strings compare unequal in JS but equal in Wasm. See a similar-themed but different Chrome bug where the bug was caused by two UTF-8 decoders generating a different number of U+FFFDs for erroneous byte sequences.
Round-tripping equality is the main issue for being able to round-trip file paths on Windows in Rust. The Web Platform is somewhat different in the sense that there aren't similar places where the platform gives you unpaired surrogates in a non-bug way (either on its own or giving you unpaired surrogates persisted by someone else) and expects their equality to be preserved. Specification-wise the Web Platform tries hard not to hand you unpaired surrogates. However, if you yourself create unpaired surrogates, the Web Platform most of the time preserves them for the duration of the execution of a JS program.
@RReverser what's the JS assertion based on? E.g., https://tc39.github.io/ecma262/#sec-encode throws on lone surrogates afaict.
For the original use case here, I think it would still be preferable to get the Gecko and Chromium keyboard event bugs fixed instead of adding facilities for optimizing workarounds. It would be good to do so before EdgeHTML fades away, as it now keeps the Web from assuming the Gecko/Chromium bug.
String.isWellFormed()
makes sense if there are enough use cases where the action taken in the ill-formed case is something other than making the string well-formed. If the action is making the string well-formed, it would make more sense to have a method that does that: String.makeWellFormed()
that returns this
if well-formed and a new String
with lone surrogates replaced with the REPLACEMENT CHARACTER otherwise.
if there are enough use cases where the action taken in the ill-formed case is something other than making the string well-formed
I think more often than not, it would be used to throw a validation error.
Chatting with some TC39 folks it seems there would be support for well-formed string APIs if someone did the work. It's not clear to me we need to track that here so I'm inclined to close this, but will give it some time in case people feel otherwise.
Draft ECMAScript proposal by @guybedford: https://github.com/guybedford/proposal-is-usv-string