STRING i32 pairs for UTF-16LE
dcodeIO opened this issue Β· 169 comments
Regarding
JavaScript hosts might additionally provide:
STRING | Converts the next two arguments from a pair of i32s to a utf8 string. It treats the first as an address in linear memory of the string bytes, and the second as a length.
Any chance that there'll be support for JS-style strings (UTF-16LE) as well? I know this doesn't really fit into the C/C++ world, but languages approaching things the other way around will most likely benefit when not having to convert back and forth on every host binding call.
Since this hasn't received any comments yet, allow me to bump this: I am still curious if UTF-16 strings can be supported. In AssemblyScript's case all strings are UTF-16LE already, so only having the option to re-encode (potentially twice if the bound API wants UTF-16) does seem like it should be taken into account.
If most hosts would require copying unicode16 into urf8 anyway, you may
have trouble with (a).
To me it looks like having such an operator can lead to significantly less work where UTF-16 is already present on both sides of the equation, while implementing the APIs in any case where either side is UTF-8 can easily be implemented by reencoding conditionally. Hence, the module would chose the operator that fits its internal string layout ideally, and the host would do whatever is necessary to make it fit into theirs. This leaves us with those cases:
- Both UTF-8: Essentially memcpy
- Both UTF-16: Essentially memcpy
- One UTF-8, the other UTF-16: Reencoding once
while avoiding the very unfortunate case of
- Module UTF-16, host UTF-16: Reencode twice because UTF-8 is all the bindings understand
But, essentially, it's a bit early to consider committing to any operators
at the moment.
I see, yet I thought it might make sense to raise this early so once committing to operators, this case is well thought through :)
while avoiding the very unfortunate case of
- Module UTF-16, host UTF-16: Reencode twice because UTF-8 is all the bindings understand
Yeah that would be unfortunate.
Any chance that there'll be support for JS-style strings (UTF-16LE)
The real question is, can we skip JS entirely? At which point, what does the host API use internally? For example a similarly bad outcome would be
- Source UTF-8 -> JS UTF-16 -> Web API UTF-8
So I think the sanest way to handle that is with a declarative API on the bindings layer. Which is what you said earlier:
Hence, the module would chose the operator that fits its internal string layout ideally, and the host would do whatever is necessary to make it fit into theirs.
So the higher-level point is that we should be able to adequately describe the most common/reasonable ways to encode strings, so that we can minimize the number of encodings in the best case.
But, essentially, it's a bit early to consider committing to any operators
at the moment.
Agree + disagree. On the one, it's all in the sketch stage at the moment, where we're feeling out the rough edges. So from a managing-expectations point of view, this makes sense to say.
On the other hand, it's kind of incongruous to say "everything's up in the air, so don't raise any design issues." I don't think that was the intent, but that's kind of how it sounded. A more real translation of how I heard it was "don't worry about this now, we'll figure it out later." To which I would say, as a general principle, that yes we'll figure it out later, but we should raise it now to figure out if we should worry about it now. Especially because multiple people can think about different bits of the spec asynchronously.
Also something I should mention explicitly:
I find it incredibly likely that we will default to 1 binding expression per β-type per wasm representation (e.g. 1 for linear memory and 1 for gc), which is to say the MVP of β-bindings will have one binding expression per type, because gc will probably not be shipped yet. On that basis, we will probably start with only UTF-8-encoding (I imagine we will drop the utf8-cstr binding too, for similar reasons).
My general mental model here is that we can always add bindings in the future as we find a need for them. And it may be the case that in practice, the re-encoding from UTF-16 isn't enough of a bottleneck to be worth it. Unless it is, at which point we can add that binding, and it will be more obviously useful because we'll have much more real-world data.
Also for AssemblyScript specifically, would it be reasonable to change the internal string representation from UTF-16 to UTF-8 in the presence of β-bindings? It is, after all, "Definitely not a TypeScript to WebAssembly compiler" π
And it may be the case that in practice, the re-encoding from UTF-16 isn't enough of a bottleneck to be worth it. Unless it is, at which point we can add that binding, and it will be more obviously useful because we'll have much more real-world data.
At the end of the day we are building just tools here and one can't know the use case of everyone. Like, any use case extensively calling bound functions with string arguments would hit this and my expectation would be that this'll happen anyway (in certain use cases). Like, if we'd wait, this'll surface sooner or later, so it can as well be addressed from the start, instead of having to tell everyone running into this that their use case is currently not well-supported even through we did see it coming. Especially since specification and implementation of new operators can take a long time again.
Also for AssemblyScript specifically, would it be reasonable to change the internal string representation from UTF-16 to UTF-8 in the presence of βοΈ-bindings? It is, after all, "Definitely not a TypeScript to WebAssembly compiler" π
I'm sorry, the "βοΈ-bindings" term is new to me. Would you point me into the right direction where I can learn about it? :)
Regarding UTF-8: In fact we have been thinking about this but it doesn't seem feasible, because we are re-implemeting String after the JS-API (with other stdlib components relying on it) and going with something else than UCS-2 representation seems suboptimal since the API is so deeply rooted into the language that mimicking UCS-2 would cost too much perf-wise. After all we are trying to stay as close to TS as reasonable to make picking up AssemblyScript a smooth experience. Also would like to note that this isn't exclusively an AssemblyScript thing, as other languages are using UTF-16LE as well, like everything in the .NET/Mono space.
Like, if we'd wait, this'll surface sooner or later, so it can as well be addressed from the start, instead of having to tell everyone running into this that their use case is currently not well-supported even through we did see it coming. Especially since specification and implementation of new operators can take a long time again.
It's ultimately a tradeoff. My thoughts here are that it will be strictly easier to spec and implement a bindings proposal that defines 8 operators, as opposed to one that defines 40. So we could just add UTF-16, but we could also just add C-strings and we could just add Scheme cons-list strings and we could just add Haskell lazy cons thunks, and so on. So for MVP I think we need to be really strict as to what exactly is "minimal", and in this context Minimal means "we can reason about strings at all".
We also need to balance the "viable" portion. Originally I was thinking we should avoid reasoning about strings and allocators at all, due to the complexity they add. Further discussion on this (see: #25) made me realize that not having an answer for allocators would compromise the viability of the proposal entirely. On that basis, not having UTF-16 support from day 1 is unlikely to leave the bindings proposal dead in the water.
By means of analogy, I would rather we ship anyref without waiting for the full gc proposal, because anyref on its own is a very enabling feature. It is in many ways suboptimal, but it it is more useful than what we had before. On that basis, I want to be very cautious about adding scope to the bindings MVP, especially when that scope is separable to a v2 that describes an expanded set of binding expressions.
I'm sorry, the "snowman-bindings" term is new to me. Would you point me into the right direction where I can learn about it? :)
Sure, @lukewagner presented at the June CG meeting, and here's the slide deck: https://docs.google.com/presentation/d/1wtAknL-UJWDoIgSbyF5paTBSpVVj-fKU4tiHMxJbSzE/edit
tl;dr does this wasm binding layer we're describing need to reason about WebIDL at its core, or is WebIDL another target with a produce/consume pair? If the latter, and we suspect that is the case, then we're free to design an IDL that better matches what we're trying to do, rather than try to retrofit that on top of WebIDL.
Full notes of the accompanying discussion here: https://github.com/WebAssembly/meetings/blob/master/2019/CG-06.md#webidl-bindings-1-2-hrs
Also would like to note that this isn't exclusively an AssemblyScript thing
Didn't mean to sound like I was saying it was :x, sorry. I was thinking that if AssemblyScript was using UTF-16 for easier FFI with JS, then in the presence of something-bindings it would be possible to decouple that ABI. And also that AssemblyScript would probably have an easier time of making that ABI switch than a more-ossified target like .NET, on account of it's a younger platform.
My thoughts here are that it will be strictly easier to spec and implement a bindings proposal that defines 8 operators, as opposed to one that defines 40
Makes sense, yeah. Though, to me it seems not overly complex to have a (potentially extensible) immediate operand on str (/ alloc-str) that indicates a well-known encoding. I'd consider UTF-8, UTF-16LE and maybe ASCII here (not sure), with length always provided by the caller (even if null-terminated), but I'm certainly not an expert in this regard.
By means of analogy, I would rather we ship anyref without waiting for the full gc proposal, because anyref on its own is a very enabling feature. It is in many ways suboptimal, but it it is more useful than what we had before.
I totally agree with the anyref mention, but don't entirely agree on the comparison with encodings. anyref is a useful feature on its own with everything else building upon it, while not addressing encoding challenges on introduction of the feature that would need to deal with it leads to half a feature that unnecessarily limits what certain ecosystems with (imo) perfectly legit use case scenarios like UTF-16 can do efficiently.
Sure, @lukewagner presented at the June CG meeting, and here's the slide deck: https://docs.google.com/presentation/d/1wtAknL-UJWDoIgSbyF5paTBSpVVj-fKU4tiHMxJbSzE/edit
Thanks! :)
So, looking at this slide it mentions utf8 exclusively similar to what we have with WebIDL. Not quite sure how it would solve the underlying issue, that is making a compatible string from raw bytes, if it moves the problem from "directly allocating a string compatible with WebIDL bindings" to "creating a DOMString/anyref compatible with βοΈ-bindings" (if I understood this correctly?). For instance, TextEncoder doesn't support UTF-16LE (anymore), but TextDecoder does.
I'd expect that at some point in either implementation "making a compatible string from raw bytes" will be necessary anyway if the primary string implementation is provided by the module, which is likely. Please correct me if I'm missing something here. Ultimately, the issue doesn't have to be solved in the WebIDL spec, but any other spec solving it would be perfectly fine as well - as long as it is solved.
Didn't mean to sound like I was saying it was :x, sorry. I was thinking that if AssemblyScript was using UTF-16 for easier FFI with JS, then in the presence of something-bindings it would be possible to decouple that ABI. And also that AssemblyScript would probably have an easier time of making that ABI switch than a more-ossified target like .NET, on account of it's a younger platform.
All good, your point makes perfect sense. Just wanted to emphasize that, even if AssemblyScript would make this change, this is a broader problem than what it might look like from this issue alone :)
I thought if WebAssembly implement WebIDL binding it should follow WebIDL spec which support three types of strings: DOMString, ByteString and USVString. Most of WebIDL which relate to WebApi mostly using DOMString which commonly interpreted as UTF-16 encoded strings [RFC2781]. ByteString is actually ASCII and at last USVString which not require concrete encoding format. Addisianal note about USVString from WebIDL spec:
Specifications should only use USVString for APIs that perform text processing and need a string of Unicode scalar values to operate on. Most APIs that use strings should instead be using DOMString, which does not make any interpretations of the code units in the string. When in doubt, use DOMString.
@dcodeIO I'd consider UTF-8, UTF-16LE and maybe ASCII here (not sure)
UTF-8 was intentionally designed as a strict super-set of ASCII, therefore UTF-8 can be used to efficiently transfer ASCII text.
UTF-8 was intentionally designed as a strict super-set of ASCII, therefore UTF-8 can be used to efficiently transfer ASCII text.
Yeah, tried to be careful there (in regards to C-strings) but the more I think about it the less I believe that this distinction is necessary, especially since any API being bound will very likely be reasonably modern anyway. So that'd leave us with UTF-8 and UTF-16LE. Anything else you could imagine would fit there in terms of "well-known encodings" (in context of modern programming languages)?
@Pauan WebIDL (except ByteString) and Javascript not using ASCII at all. Strings in javascript represented as UTF-16LE by default but v8 for example can represent strings in different ways and encodings internally. For example during concatenation strings can represent as rope structure which flattened to "normal" string before serialization / father conversion or before passing to Web Api. But that doesn't mean we should use rope structure as default structure for string for example. The same with UTF8
Side note: USVString looks like it can be described in terms of UTF-32 (not sure if that makes sense as I don't know anything using it for its internal representation). But maybe the least common denominator is UTF here?
About ByteString in WebIDL
Specifications should only use ByteString for interfacing with protocols that use bytes and strings interchangeably, such as HTTP. In general, strings should be represented with DOMString values, even if it is expected that values of the string will always be in ASCII or some 8 bit character encoding. Sequences or frozen arrays with octet or byte elements, Uint8Array, or Int8Array should be used for holding 8 bit data rather than ByteString.
@MaxGraey I am aware. The purpose of WebIDL bindings is to allow many different languages to use WebIDL APIs without using JavaScript.
Since each language does things differently, that means there needs to be a way to convert from one type to another type.
That's why there's a UTF-8 -> WebIDL string conversion, to allow for languages like Rust to use WebIDL bindings (since Rust uses UTF-8).
That's why there's a UTF-8 -> WebIDL string conversion, to allow for languages like Rust to use WebIDL bindings (since Rust uses UTF-8).
So every browser which has already implemented WebIDL bindings for Javascript and rest of languages like C#/Mono, Java, Python and other which still popular today should change its internal string representation? I guess all this languages in total much more popular then Rust no matter how it awesome)
I don't mind utf8-str but I think proposal should care about utf16le-str as well =)
webidl bindings proposal already care about pretty special and still allow only in C/C++ null-terminated strings (utf8βcstr). So it already care about backward compatibility for legacy approaches)
So every browser which has already implemented WebIDL bindings for Javascript and rest of languages like C#/Mono, Java, Python and other which still popular today should change its internal string representation?
I'm not sure where you got that idea... you seem to be misunderstanding how all of this works. I suggest you read the recent slides, especially slide 29.
The way that it works is that the browser implements WebIDL strings (using whatever representation it wants, just like how it does right now). And then there are various "binding operators" which convert from other string types to/from the WebIDL strings.
So you can have a binding operator which converts from UTF-8 to WebIDL strings, or a binding operator which converts from UTF-16 to WebIDL strings. The browser doesn't need to change its internal string representation, it just needs to implement a simple conversion function.
I'm also not sure why you're bringing up languages like C#/Mono, Java, or Python... they are also implemented in WebAssembly linear memory, and so they need binding operators. The binding operators are not a "Rust-only" thing, they benefit all languages. That's why it's a UTF-8 conversion, so it can be used by all languages which use UTF-8 strings.
I'm also not sure why you're bringing up languages like C#/Mono, Java, or Python... they are also implemented in WebAssembly linear memory, and so they need binding operators. The binding operators are not a "Rust-only" thing, they benefit all languages. That's why it's a UTF-8 conversion, so it can be used by all languages which use UTF-8 strings.
I believe the point he wanted to make is that all those languages use UTF-16LE internally so all of them would face the potential performance penalty this issue is about.
I believe the point he wanted to make is that all those languages use UTF-16LE internally so all of them would face the potential performance penalty this issue is about.
Okay, but I never spoke about UTF-16 (which I am in favor of).
I only said that languages which use ASCII do not need a special "ASCII binding operator", since they can use UTF-8 instead.
I only said that languages which use ASCII do not need a special "ASCII binding operator", since they can use UTF-8 instead.
Yes, just one note it's C (C++ probably as well) and it should use utf8-cstr - null-terminated version of utf8-str: https://github.com/WebAssembly/webidl-bindings/blob/master/proposals/webidl-bindings/Explainer.md#binding-operators-and-expressions
So, to recap my perspective a little here, maybe one way to avoid re-encoding on every host-binding call, discriminating languages following another UTF standard, could be to make the encoding kind an immediate operand of utf-str and alloc-utf-str (dropping the 8), with valid encodings being UTF-8 (& UTF-8-zero-terminated?), UTF-16LE and potentially UTF-32 (USVString <-> USVString fallback?). Based on the pair of (source-encoding, target-encoding), the host would either preserve the representation if both are equal, or convert into either one depending on what it deems appropriate.
Since those encodings are relatively similar, I'd say that the implementation isn't a significant burden, while solving the issue for most modern programming languages for good.
If it is decided that WebIDL-bindings should not provide string operations, that'd be fine, but in this case whatever is decided-upon as the alternative should take it into account (note that anything based upon TextEncoder currently doesn't).
Hope that makes sense :)
Note that JavaScript strings are not UTF-16, they're 16-bit buffers. UTF-16 has constraints that JavaScript does not impose.
Yes, In JavaScript most of operations is not "unicode safe" and interpret that 16-bits as UCS-2 except String#fromCodePoint , String#codePointAt, String#toUpperCase/String#toLowerCase and several others. But UTF16LE and UCS-2 has same 16-bit storage so for simplicity most people call that UTF16 encoding
The distinction is nonetheless important because you could imagine a language having support for UTF-16 the way Rust has support for UTF-8 (8-bit buffer with constraints) and that's not a good fit for what OP is asking for.
UCS-2 is a strict subset of UTF-16. It means if we use UCS-2 we could always reinterpret as UTF-16 without any caveats if both have same endian. UTF-16 just understands surrogate pairs - UCS2 isn't.
UCS-2 is obsolete terminology which refers to a Unicode implementation up to Unicode 1.1, before surrogate code points and UTF-16 were added to Version 2.0 of the standard. This term should now be avoided
So I don't think it's big deal for current topic
Surrogate pairs are not the issue, lone surrogates are.
Yes, sure lone surrogates are problem for UTF16 and UTF8 as well:
https://speakerdeck.com/mathiasbynens/hacking-with-unicode-in-2016?slide=106
And I guess it's not be a problem for modern encoders/decoders?
My understanding of UTF-16LE here is based on this piece of information:
Most engines that I know of use UTF-16
The ECMAScript/JavaScript language itself, however, exposes characters according to UCS-2, not UTF-16
Ultimately this issue isn't solely about the specialties of JS strings of course, hence not so much about UCS-2 as an outdated standard.
I thought if WebAssembly implement WebIDL binding it should follow WebIDL spec
Maybe. As I had just said:
Sure, @lukewagner presented at the June CG meeting, and here's the slide deck: https://docs.google.com/presentation/d/1wtAknL-UJWDoIgSbyF5paTBSpVVj-fKU4tiHMxJbSzE/edit
tl;dr does this wasm binding layer we're describing need to reason about WebIDL at its core, or is WebIDL another target with a produce/consume pair? If the latter, and we suspect that is the case, then we're free to design an IDL that better matches what we're trying to do, rather than try to retrofit that on top of WebIDL.
Full notes of the accompanying discussion here: https://github.com/WebAssembly/meetings/blob/master/2019/CG-06.md#webidl-bindings-1-2-hrs
Also, WebIDL does not specify a wire format. The second half of the sentence on DOMString:
Such sequences are commonly interpreted as UTF-16 encoded strings [RFC2781] although this is not required
Yes that's splitting hairs.
webidl bindings proposal already care about pretty special and still allow only in C/C++ null-terminated strings (utf8βcstr).
Yeah I'd been thinking that was a mistake for a while. #43
In particular it was proposed in order to show the types of bindings that could be modeled. And indeed, the discussions it spawns has shown that we would need one binding per possible encoding, which maybe 10 years from now is fine. For MVP, no.
Since those encodings are relatively similar, I'd say that the implementation isn't a significant burden, while solving the issue for most modern programming languages for good.
My issue here isn't so much that the implementation would be a burden, but one of defending against scope creep.
And I'd prefer a scheme where we do spec an entirely new binding for every new encoding, rather than parameterizing over the encoding. Parameterizing doesn't save us implementation effort, but adds some complexity to the binding spec. On that basis, I don't think we're missing any elegance by not specing UTF-16 now and adding it later.
And I'd prefer a scheme where we do spec an entirely new binding for every new encoding
+1 on this.
So, with two well-regarded voices positioned against my compromise, that'd essentially mean (as of today) there'd need to be
utf8-str/alloc-utf8-strutf16-str/alloc-utf16-strutf32-str/alloc-utf32-str(potentially)
likely leading to the conclusion that having more than one pair of instructions initially is not in the scope of the MVP, which is a much easier point to defend.
by not specing UTF-16 now and adding it later.
That's why I suggested the compromise in the first place, since I think not addressing an issue for non-technical-reasons that multiple languages will run into immediately while seeing it coming is wrong.
To me personally this feels like a proper implementation of the feature is being prevented through the backdoor for the wrong reasons.
for non-technical-reasons
Whereas I view wasm-bindings in general as a technical mechanism to help resolve the already-non-technical problem of language interop anyway. For me a major guiding principle is that of facilitating coordination between mutually non-cooperating implementors. In particular the bit is mentioned in WebAssembly/design#1274 :
Provide a Schelling point for inter-language interaction
This is easier said than done, but I think wasm should send a signal to all compiler writers, that the standard way to interoperate between languages is X.
Wherein the existence of any some standardized mechanism for interop provides a natural target. (Schelling points are fascinating in general). The risk as I perceive it is that people are going to write code whether we provide a mechanism or not. For application developers, "we have a feature coming in ~12 months, maybe" is not something they are going to wait for. So they're going to ship something. I want it to be this, and on that basis it matters hugely whether we can ship in browsers in 2020, vs 2021.
UTF-16, alone, is not going to push us back that far. I'm worried about "but we have UTF-16, so what about..." creeping in. We see that in this thread, with the incredibly-dubious utf8-cstr used as justification for just one more specific binding.
The cruxes of the issue, for me, are:
- From the perspective of UTF-16-using languages, I do not see the difference between shipping wasm-bindings v1 in 2020, and v2-with-UTF-16 in 2021, vs adding UTF-16 to v1, but delaying v1 to 2021.
- From the perspective of the broader ecosystem, shipping v1 in 2020 vs 2021 can be a massive difference.
The bit that's much more of an open question is, if/when we inevitably add UTF-16, what is the proper mechanism?
On a technical side, I don't see the difference between
A) utf8-str + utf16-str
B) str(utf8) + str(utf16)
likely leading to the conclusion that having more than one pair of instructions initially is not in the scope of the MVP, which is a much easier point to defend.
because regardless of pairs of instructions, we still have pairs of encodings. The difficulties I see are 1) formalizing that in the spec, and 2) wiring up the host's existing decoders. Neither of those are made simpler by parameterizing over the encoding in the format. So assuming I'm right about limiting scope, we can choose for v1 whether we ship just utf8-str or just str(utf8). Having a parameterized encoding doesn't change the v1-ability of utf16 bindings.
I see two ways I can be wrong about that:
- it does reduce implementation complexity
- I'm making the wrong tradeoff of M vs V in MVP
Whether to parameterize in the encoding is moreso a matter of taste at that point, though I suspect it's slightly more work to spec, so I favor the separate-instr-per-encoding design on that basis. This is the weakest-held of my opinions though.
I mention all of this because if I am fatally wrong I would much rather know about it now than in 2022.
I have many more thoughts on the meta-side of this but I'm going to make this message a 2-parter for latency purposes.
The first sentence of the explainer also reads
The proposal describes adding a new mechanism to WebAssembly for reliably avoiding unnecessary overhead when calling, or being called, through a Web IDL interface
My expectation would be that "reliably avoiding unnecessary overhead" is a priority, even for an MVP, since it's literally the first sentence, whereas
I view wasm-bindings in general as a technical mechanism to help resolve the already-non-technical problem of language interop anyway
and
I think wasm should send a signal to all compiler writers, that the standard way to interoperate between languages is X.
seem like an overarching goal that does not play well with what the proposal is trying to solve in the first place, especially since we are not even talking exotic encodings here but UTF.
Regarding
UTF-16, alone, is not going to push us back that far. I'm worried about "but we have UTF-16, so what about..." creeping in
it looks like adding the set of well-known UTF encodings is sufficient for an MVP because it covers like 90% of languages, while just UTF-8 is not even close when looking at the list of languages above. Could as well name this proposal "WebIDL-bindings-for-C-and-Rust" then, as my expectation would be that the MVP of the spec remains irrelevant for something like AssemblyScript for who-knows-how-long since alternatives will still be faster.
- From the perspective of UTF-16-using languages, I do not see the difference between shipping wasm-bindings v1 in 2020, and v2-with-UTF-16 in 2021, vs adding UTF-16 to v1, but delaying v1 to 2021.
- From the perspective of the broader ecosystem, shipping v1 in 2020 vs 2021 can be a massive difference.
That's an assessment I do not share. While shipping it in the MVP does indeed involve additional work, I can't see how "slightly more work to spec" or "wiring up the host's existing decoders" would lead to delays of such magnitude.
Having a parameterized encoding doesn't change the v1-ability of utf16 bindings.
I agree on that, just wanted to point out the potential misconception that might arise here in that holding back on multiple instructions can magically seem more appropriate, even though the underlying concern remains unchanged (which I think just began).
Edit: Maybe another point: If it was an (extensible) operand, implementers could opt to support UTF-16 early, but it's not that easy if it requires an entirely new instruction, with the only alternative being to wait.
Edit: Maybe one more point: I thought that the specs benefit from being used by multiple ecosystems (that worked for the WASM MVP at least), but what's proposed here is doing the exact opposite, essentially making the MVP only viable to the first-class club. That's sad, because we'd like to be involved.
Apart from that, I feel that I should mention that I appreciate your thorough comments, even though I don't agree with certain aspects. :)
To me personally this feels like a proper implementation of the feature is being prevented through the backdoor for the wrong reasons.
I for one welcome spirited debate, and hope it doesn't feel like we backdoor any of this. To that end, I've been wanting to put together an informal video/voice chat so interested parties can have more high-bandwidth discussions than github issues allow. I'll draft some sort of pre-work for that, probably tomorrow, and post it as an issue.
There will also be more opportunities to say "you're doing it wrong" when we have a prototype implementation in a browser, and you'll be able to target that, and show us with data how suboptimal it is exactly. My prediction is that AS->wasmBindings->Host will be more efficient than AS->JS->jsBindings->Host, even with an extra reencoding. There's a couple ways that measurement could turn out, and the right thing to do will depend on the in-practice data.
While shipping it in the MVP does indeed involve additional work, I can't see how "slightly more work to spec" or "wiring up the host's existing decoders" would lead to delays of such magnitude.
Not from UTF-16 alone, but on the assumption that a less hard-nosed approach stance on what is in scope for MVP could lead to 2x the binding operations, and that would add months of time. Bit of a slippery slope.
A non-small part of that is I personally want All The Bindings for All The Languages, but if we don't have clear launch criteria we could spend years specifying this. My strategy to get everything is to make sure we have enough room to extend the design for v2+, so we can do something now and everything later. Where that line is is negotiable, and it is likely that I am overcorrecting because I have this argument with myself on a regular basis :)
Could as well name this proposal "WebIDL-bindings-for-C-and-Rust" then,
Could probably name the MVP of WebAssembly as C-and-RustAssembly on a similar basis. By analogy, 1) we see that in practice people build things on top of wasm anyway, and 2) post-MVP wasm is becoming increasingly awkward for C and Rust to support (we don't have a good story for anyref in C for example).
I agree on that, just wanted to point out the potential misconception that might arise here in that holding back on multiple instructions can magically seem more appropriate, even though the underlying concern remains unchanged
Agree. Also by "slightly more work to spec" I mean one instr w/ two params vs two instrs, because that's three parts + a composition instead of just two parts.
Maybe another point: If it was an (extensible) operand, implementers could opt to support UTF-16 early, but it's not that easy if it requires an entirely new instruction, with the only alternative being to wait.
Disagree, any wasm compiled to that target isn't interoperable either way, and the implementation effort of having both still isn't wildly different (for either producer or consumer).
Apart from that, I feel that I should mention that I appreciate your thorough comments, even though I don't agree with certain aspects. :)
Thanks, that's good to hear :)
There will also be more opportunities to say "you're doing it wrong" when we have a prototype implementation in a browser
Having to re-encode twice in any UTF16->UTF8->UTF16 scenario (like AS calling JS APIs) does seem like sufficiently unnecessary overhead already that we should avoid regardless of any eventual findings with a prototype. If the prototype shows that this is still faster, my conclusion wouldn't be that it's fast enough, but that the alternatives are too slow.
My prediction is that AS->wasmBindings->Host will be more efficient than AS->JS->jsBindings->Host, even with an extra reencoding
Make that two extra re-encodings. One use case I think of in this regard btw is AS code extensively calling Canvas2D APIs which take colors, fill styles and whatnot as strings, and to me it looks like something custom, for example using a lookup array of generated ids to string refs when calling out to the host, will be a serious contender to re-encoding hell if function imports are sufficiently optimized.
One could even go as far as to conclude that picking UTF-8 as the initial default is at least as arbitrary as picking UTF-16 as the initial default, depending on whether one is looking at this from a producer or a consumer standpoint. Like, the spearheading runtimes for the spec will be the major browsers, and the majority of bound APIs will be JS APIs. So one could argue that picking UTF-16 would make a more reasonable initial default. Not saying that, but excluding UTF-16 from the MVP feels even more arbitrary on this background to me.
To that end, I've been wanting to put together an informal video/voice chat so interested parties can have more high-bandwidth discussions than github issues allow. I'll draft some sort of pre-work for that, probably tomorrow, and post it as an issue.
π
I think the nuance of my comments got missed above, but to reiterate neither UTF-8 nor UTF-16 technically allow unpaired surrogates. (They have an identical value space.) A DOMString being a 16-bit buffer does allow for representing unpaired surrogates. USVString does not. ByteString is best represented by an 8-bit buffer.
A question with either UTF-8 or UTF-16 representation is how you deal with unpaired surrogates. Map them to U+FFFD, trap, something else?
I am not sure I agree that UTF16 is a better default than UTF8.
Yeah, I mostly made that argument to underline that UTF-16 is not less important, even though I know about the importance of UTF-8 when looking at it the other way around. Ideally, both would be supported in the MVP so both perspectives are equally covered.
There is a slippery slope argument to be had; and that is why Jacob has
been rightfully resisting.
I understand that keeping the MVP MV is an important aspect, and that being careful with what to include makes perfect sense. Though I think that the case of UTF-16 does not fulfill the knock-out criterion.
On the other other hand, where do we stop with
strings? E.g., it is entirely possible that more applications use code
pages than UTF
My suggestion would be to stop at UTF-8 and UTF-16 for the MVP, which sufficiently cover the most common encoding on the producer plus the most common encoding on the consumer side, and think about everything else in a v2.
A question with either UTF-8 or UTF-16 representation is how you deal with unpaired surrogates. Map them to U+FFFD, trap, something else?
That's a good question, yeah. My immediate though on this is that the binding shouldn't impose any restrictions that'd invalidate likely scenarios or lead to significant overhead, and leave producing errors to the implementations that actually require it. In the best case, that's no scan at all, while in the worst it's one scan to assert additional restrictions.
That'd essentially mean that the binding wouldn't care about the interpretation of the byte data, which it most likely won't do anyway at runtime in order to be fast, with the encoding being more an indicator. Now I'm not exactly sure about the UTF8<->UTF16 case, but it looks to me that the problem is similar in both so piping through unpaired surrogates as-is, delegating any checks further down the pipeline, seems to be the most sensible thing to do.
One could even go as far as to conclude that picking UTF-8 as the initial default is at least as arbitrary as picking UTF-16 as the initial default, depending on whether one is looking at this from a producer or a consumer standpoint
That is an extremely valid point. Honestly my general point of "let's only ship one encoding in MVP" would be content with only having UTF-16. I suspect that would be controversial :D
and we may spend more time arguing about it than it would take to implement
I'm weirdly kind of ok with that being the calculus, though I wonder if that sets a dangerous precedent...
There is a slippery slope argument to be had
Thinking about this more, cstr and utf-16 are different enough in character that I think the slipperiness of the slope is less dangerous. cstr is for one language and has immediate obviously-better options available, utf-16 is used in more places, up to and including the browser itself.
That "browser itself" part is probably the most compelling, because even in a C program you may want maximum performance, and because Blink's wtfStrings are UTF-16, you will necessarily pay a single ASCII->UTF-16 encoding cost at every boundary. For that reason a C program that wanted to avoid that could use a JSString that is UTF-16 encoded, and reuse that for multiple calls without needing to re-encode each time.
For that reason I think we should probably ship with both. It's not strictly M, but it should be useful enough to warrant it. Minimality is not itself an axiomatic condition.
My immediate though on this is that the binding shouldn't impose any restrictions that'd invalidate likely scenarios or lead to significant overhead, and leave producing errors to the implementations that actually require it.
π
For a 16-bit buffer no validation would be required (please don't call it UTF-16 in that case), but for an 8-bit buffer you would need to define some conversion process as even if you permit WTF-8 (which allows unpaired surrogates), there'd still be invalid sequences that you'd need to handle somehow as they cannot be mapped to a 16-bit buffer (i.e., DOMString). And for USVString there are more tight requirements, which if you don't handle them via a type, will instead incur a cost at the binding layer, which isn't exactly great. So not imposing any restrictions at all does not seem like the kind of thing you'd want here.
Thanks @jgravelle-google for trying to hold the line.
A fundamental property of this proposal is that the implementation complexity is gonna be O(N^2) in the number of binding operators for each given type. So adding "just one more" is not the no-brainer it may seem.
There is a choice to be made. Either keep this mechanism Web-specific. Then it can be overloaded with all sorts of Web/JS goodies like UTF-16; respective engines are hyper-complicated already. But I doubt any non-Web engine would want to implement it.
Or make this mechanism more universal, i.e., the snowman idea. Then it is crucial for wider adoption in engines to keep it as small as possible. Putting Unicode transcoders into every core Wasm engine is not the route to go down, and misses the point of Wasm.
All The Bindings for All The Languages
That is completely unrealistic and cannot be the goal. Rule of thumb: there are (at least) as many data representations as there are languages. There is no canonical set. It would literally mean hundreds of language-specific binding operators (remember: N^2 complexity) -- all baked into a code format that supposedly was low-level and language-independent.
So, from those two comments, I'd take away that asserting the well-formedness of either or both UTF-8 and UTF-16 would require at least a validation scan on every boundary (since the producer might be doing something wrong), leading to the binding having to do significantly more work, in turn requireing Wasm engines to ship significantly more code, which contradicts the purpose of this proposal.
Hence I suggest to add to the spec that the binding does not ensure well-formedness of the encoding for those reasons, and that either
- if a consumer requires inputs to be well-formed, it must assert this condition on its own where necessary or take the respective measures to deal with ill-formed sequences in a more general way.
or
- if a producer wants to ensure that such a string is understood by a consumer, it should take the respective measures for the exact API call in particular, since it is perfectly possible that a producer is dealing with multiple consumers with different requirements.
Not sure which of the provided alternatives is best. The first doesn't restrict producers, the second doesn't restrict consumers. From a solely "avoid unnecessary overhead" perspective, it looks like the second might be more straight forward because it requires less general defenses.
To be more concrete, if a potentially ill-formed UTF16->UTF8 conversion is taking place, the unpaired surrogate should become an ill-formed single code point representing its value (as three bytes), essentially piping through ill-formedness. Likewise, if a potentially ill-formed UTF8->UTF16 conversion is taking place, the respective code point should become an unpaired surrogate again, essentially piping though ill-formedness.
In general, the WTF-8 encoding seems to fit this well, since it has been created in presence of this relatively common scenario (if I'm not missing something they do differently from what I've written above).
WTF-8 (Wobbly Transformation Format β 8-bit) is a superset of UTF-8 that encodes surrogate code points if they are not in a pair. It represents, in a way compatible with UTF-8, text from systems such as JavaScript and Windows that use UTF-16 internally but donβt enforce the well-formedness invariant that surrogates must be paired.
https://simonsapin.github.io/wtf-8/
WTF-16 is sometimes used as a shorter name for potentially ill-formed UTF-16
https://simonsapin.github.io/wtf-8/#ill-formed-utf-16
WTF-8 (Wobbly Transformation Format β 8-bit) is an extension of UTF-8 where the encodings of unpaired surrogate halves (U+D800 through U+DFFF) are allowed. This is necessary to store possibly-invalid UTF-16, such as Windows filenames. Many systems that deal with UTF-8 work this way without considering it a different encoding, as it is simpler.
https://en.wikipedia.org/wiki/UTF-8#WTF-8
It appears that this is the least common denominator here. I have no strong opinion on additional encodings, but do somewhat agree with rossberg. Yet, the two encodings mentioned here appear necessary in JS/Browser<->JS/Browser (here: not only browsers but anything that does it the JS-way) and C/Native<->C/Native scenarios, which are by far the most likely ones, while also allowing both to talk to each other and giving any other use case the option to pick the one that fits its use cast best.
If I'm missing something, please point out what it is :)
A fundamental property of this proposal is that the implementation complexity is gonna be O(N^2) in the number of binding operators for each given type.
Strong disagree. We only need N^2 work if each pair of bindings needs a unique implementation. If we don't specify the runtime characteristics, then a reasonable implementation might look like: Translate from A.wasm to some engine-specific IR (O(N) implementation effort) + engine IR to B.wasm (O(N) implementation effort). This then leaves room to optimize a subset of end-to-end bindings, e.g. we could then specially optimize UTF-16 -> UTF-16 to be a memcpy, but not UTF-8 -> UTF-8 (or vice-versa), which implies O(1) additional work that's separable from a minimal implementation of the standard.
Or make this mechanism more universal, i.e., the snowman idea.
I believe that's what I described just here, if there's more missing there let me know. The key is whether the central snowman types are conceptual or reified, and how much wiggle room we leave in the spec.
That is completely unrealistic and cannot be the goal. Rule of thumb: there are (at least) as many data representations as there are languages. There is no canonical set. It would literally mean hundreds of language-specific binding operators (remember: N^2 complexity) -- all baked into a code format that supposedly was low-level and language-independent.
Just because I want something doesn't mean I think it's realistic.
Though the broader point is that I don't believe we need completely seamless bindings to support all the languages well-enough to be worth doing. The mapping doesn't need to be total. I think of it as being similar to a Voronoi diagram, in that we just need a set of points that are reasonably close to existing languages.
For example, if we only had UTF-16, C would need to re-encode from a char* into a wchar*. But if we only had cons-list-of-gc-chars, C would need special support in the compiler to be able to handle that. I don't expect every language to map perfectly, but not all imperfect mappings are equal.
We only need N^2 work if each pair of bindings needs a unique implementation.
To actually have any benefit from additional binding types the engine will need a specialised implementations for it, or am I missing something? What's the point of adding them when they do not optimise at least a few paths? And if they do, that puts you at O(N^2) (note the O, though).
Preface: this got rather long and general. About half of the detail/pedantry here isn't meant to be directly confrontational, but more as a response to getting these questions a bunch, and actually putting my thoughts in a public forum.
To actually have any benefit from additional binding types the engine will need a specialised implementations for it, or am I missing something?
For any performance benefit, and even then only kinda. There's three tiers of speed:
- needing to go from Wasm->JS, through some amount of JS conversion glue code (creating JS strings from memory, table management for references), then from JS->Host
- being on-par with JS, being able to go Wasm->Host roughly as quickly as JS->Host
- being able to eliminate almost all conversions, making Wasm->Host calls on par with Host->Host calls
(Note that "Host" here could also, and almost surely does, mean "another Wasm module". If Host was always the embedder, this would be O(N) effort in all cases)
Today we have 1). I believe that even with an O(N+M) non-specialized solution, we should get within a sub-2x factor of 2); we may have an additional conversion, but can avoid the dynamic checks of JS, which also needs to convert once. I'm not actually sure we can reach 3), and surely not for all N^2 combinations.
My performance goal is to get us from 1) to 2), and for that I believe an O(N+M) strategy is sufficient.
Further, my non-performance goal is to have a inter-module communication channel that does better than C FFI. To that end, more binding types offers more flexibility for producers, and offers more value in the "better than C FFI" front. On one extreme, we could model bindings as being isomorphic to a C ABI, but at that point we have done nothing to improve the state of the world in that dimension. And I believe that facilitating an ecosystem of intercommunicating, distributed, mutually-untrusting Wasm modules is more important than ensuring a performance characteristic of 3) at the binding layer.
at least a few paths? And if they do, that puts you at O(N^2)
Depends on how you define "a few". If a few means "a constant x% of all possible pairs", then yes. If a few means "these three specific pairs we care about", then no. My understanding is that engines optimize based on usage, and that usage patterns tend to follow Zipf's Law, making the effort O(N+M) for simple bindings + O(log(N*M)) for optimized.
But, more crucially, the degree of implementation effort becomes a choice for the implementors. If we mandate that all bindings must be equally fast, (aside from being impossible) then we require O(N^2) work. If we simply state that all bindings must function, then engines have more leeway in how much surface area requires optimization, and can gradually increase (or decrease!) the amount over time.
@jgravelle-google, I'm having trouble making out wether you are talking about webidl bindings or the generalised snowman bindings idea. Because going through JS seems to implicitly assume the former.
I'm fine with putting all sorts of ad-hoc complexity into webidl bindings, since they target JS and the Web, which are concrete and hyper-complicated ad-hoc beasts already. Go wild, I don't mind!
I'm only arguing about snowman. If we want to abstract away from webidl then we'd neither want to specialise for particular languages nor for particular host environments. Your points (1) and (2) are not even applicable in that setting. Moreover, (3) is fundamentally impossible when module and host don't share the same representations. So from that perspective, we should avoid picking arbitrary points of comparison and focus on simplicity and generality.
But, more crucially, the degree of implementation effort becomes a choice for the implementors.
The problem I see with making that a selling point is that it doesn't mesh well with one of Wasm's basic goals: predictable performance.
Exploring the "simplicity" and "generality" roads a bit further:
-
Simplicity: Dictating a specific encoding leads to those pesky re-encode twice scenarios if both producer and consumer don't match it.
-
Generality: Making strings a
(buffer, encoding)pair with the consumer deciding how to deal with it means that every single consumer has to ship all the necessary encoders.
As such, I can't see how snowman bindings aren't affected as well, e.g., C (or Rust) talking to .NET (or AssemblyScript), independently of whether one of these is the host. To me it seems that this requires a reasonable compromise anyway, with the foremost thing to generalize being to make no distinction between a client and a host (but that'd mean WebIDL == snowman).
I'm only arguing about snowman.
Same. I don't remember whether I was using the Web as a specific example of an embedder with an encoding opinion, or trying to tell a story of how the performance we could expect in general to be O((2)).
(but that'd mean WebIDL == snowman)
It does/should. Rather, WebIDL β snowman. Snowman is the only bindings, it just needs to be sufficient to describe WebIDL APIs.
To both points, my note here is that this needs to work adequately in the browser, though the web need not be the only embedder that's relevant.
So from that perspective, we should avoid picking arbitrary points of comparison and focus on simplicity and generality.
Yeah. I generally think using concrete examples can be illustrative though. The principles of simplicity and generality are good starting points, but the picture isn't complete until we can map it down to reasoning about consequences for implementation.
The problem I see with making that a selling point is that it doesn't mesh well with one of Wasm's basic goals: predictable performance.
I assuage this bit of cognitive dissonance by saying 1) this isn't a core-wasm-language feature (it's a normative section to be sure, but it doesn't impact "normal" wasm codegen), and 2) it's analogous to an RPC call, which doesn't have predictable performance anyway. And one can imagine a variety of physically-distributed systems where these bindings describe actual RPC calls, and there's no way to enforce performance in those cases.
Thinking further, bindings are really analogous to wasm imports as we have them today. It is currently the case that one cannot reason about performance across an import boundary, because at that point one is exiting the scope of the wasm module itself. To speak less-abstractly and more web-specifically, wasm<->wasm calls can be cheaper than wasm<->JS calls, because one of them invokes a conversion (toJSVal) absent in the other. And even then the performance overhead isn't predictable because different browsers have different performance characteristics here.
For bindings, I think the most predictable we can be is to give bounds, e.g. "at worst, this will involve N copies and M re-encodings", where N=2 and M=2. The only ways to give tighter bounds are giving up generality, imposing N^2 cost on consumers, or moving the copies+encodings into user code. Given that, performance nondeterminism seems like a lesser evil.
WebIDL β snowman
Thanks for the clarification, that wasn't at all obvious to me. Previously I only feared snowman feature creep optimising for a privileged set of languages running on the inside. Now I realise that there also is the dual problem of feature creep optimising for a privileged set of host environments on the outside. Can't say that I'm worried less now. ;)
I'm sorry if this comes across as negativity, but I'm really scared of the ultra-slippery slope into over-fitting and unbounded complexity.
I'm sorry if this comes across as negativity, but I'm really scared of the ultra-slippery slope into over-fitting and unbounded complexity.
Yeah I think it's extremely important to be concerned about these things. In particular, this axis of extensibility is the place where scope creep can occur, and it's imperative that we set the right philosophy for both short and long term.
I think the philosophy of "one binding per wasm/host feature" is a good way to keep the scope strongly limited, while still providing an accessible target for most languages. For example, a sequence of T can be represented by a pointer+count pair for linear memory modules, and an array of references for gc modules. So Scheme needs to map from a cons list to a linear array, which is a little awkward, but less awkward than having to introduce a memory to write into.
For host features it gets interesting, and that I believe is the main focus of this specific discussion: should we support multiple string encodings, and if not, which ones do we support? From a complexity standpoint, we should only support one encoding. From a performance standpoint, we should support all reasonable encodings. My intuition has been saying that complexity is more important than performance, and we want to be very careful to avoid specifying too much. However, the real answer should not be dogmatic, and hinges on the question of "how much complexity would multiple string encodings actually impose?", which I think we need more data to give a meaningful answer.
So, I say we start with one (arbitrarily-chosen) string encoding on the host's end for simplicity of prototyping and spec-ing, and then expand as necessary, possibly before we ratify the MVP proposal.
Character encodings are something that programmers frequently get wrong, usually because they don't understand them fully.
A decent design here would be for string operations to always operate in code point space, and abstract the UTF-8/UTF-16 difference away. For example, let the Wasm runtime decide which encoding to use, and it can pick the one most appropriate for the host environment.
Operations like iterating over the string would work fine with this model. Rust does not support random-access indexing because it's an O(N) operation in code point space, and I would expect a WebAssembly string interface to do the same.
Thanks. I think in general, the world I would like to see is one where I write once, and as long as I use general-enough operations, I can run my code natively in either a UTF-8 or UTF-16 environment. This could happen by, for example:
- The source language (e.g. Rust) can have a compiler flag that makes two separate UTF-8 and UTF-16 .wasm files. It would pass the desired configuration to the proposed
string-to-memory, and output different assembly instructions to adjust for the necessary offsets. - The WebAssembly spec provides a fixed set of operations that work on string types, and the runtime decides whether to evaluate them in UTF-8 or UTF-16 mode.
If the second choice is something the WebAssembly designers also have in mind, that may influence the decisions on this interface proposal.
If there's some other discussion on this topic elsewhere, pointers would be appreciated.
I think supporting UTF-16 is important, given just how enormous of a presence it has. There's the obvious case of JavaScript, but then there are also things like Qt's new WebAssembly backend. I don't think only supporting UTF-8 will encourage any software to switch; it will just introduce conversion overhead, minor as it may be.
I don't think that interface types are the place to push UTF-8. Ultimately it's up to the languages, not the bytecode. No one will choose which encoding to used based on what interface types do, they'll choose based on what the languages involved do. Supporting more Unicode encodings means removing redundant copies, it's as simple as that. It won't lead to some dystopia where everyone is switching to UTF-16 because it wasn't barred up enough.
Supporting more Unicode encodings means removing redundant copies, it's as simple as that.
As a rule of thumb, it's never as simple as that.
In particular here, the number of conversions between encodings that need to Just Work is n^2. For n == 2 (just UTF-8 and UTF-16), that's neither unmanageable nor unbounded. So that's 4x the amount of work to spec and implement string encoding conversions. In a world where, today, we have 0 Interface Types support in browsers, I think a reasonable strategy is to start with one encoding, while keeping in mind how we can expand in the future.
We don't need to get this perfectly right from day one, and having redundant copies that we can improve in future revisions is the kind of imperfection we should be aiming for. Interface Types is a massive enough proposal as it is, and being ruthless about deferring things to post-MVP features wherever possible is an important mechanism for limiting scope without hurting functionality in the long term.
To be clear there is a failure mode here: if string-string encoding is unacceptably slow, there is a risk that people will expose a UTF16Buffer type as an array of chars, sidestepping the ABI-hiding benefits IT is supposed to be providing. In the fullness of time we do need to come up with a way to handle this well.
In particular here, the number of conversions between encodings that need to Just Work is n^2.
Yeah, I wouldn't propose supporting non-Unicode encodings. Most of them are on their last legs and it wouldn't make sense to build them into a brand-new standard. I would just want conversion between the official UTFs, which requires no lookup tables, only bit fiddling. If you're dealing with non-Unicode text, in my opinion you should be using some other non-generic string type anyway.
We don't need to get this perfectly right from day one, and having redundant copies that we can improve in future revisions is the kind of imperfection we should be aiming for.
If UTF-16 is left out just to get the MVP out, I'm absolutely fine with that. It's not urgent, and it can always wait. I would just be disappointed if UTF-16 were intentionally left unsupported just because of the βUTF-8 Everywhereβ concept. I absolutely support UTF-8 Everywhere, but my particular version of it would apply more to programming languages and file formats than for plumbing that needs to connect already-existing pieces that have already set the encoding in stone.
Thanks for your response on this. No matter what happens, interface types will be very useful.
It's completely understandable that we shouldn't spec a 3rd or 4th encoding just yet, and nobody is suggesting that here. However, ignoring UTF-16, which is very likely the second most commonly used encoding in languages currently targeting WebAssembly (including the browser/JavaScript), for the wrong reasons will potentially lock out more than half of the language ecosystem from this proposal (not because everyone is using UTF-16, but because not everyone is using UTF-8). How isn't that worth avoiding, considering that the worst case scenario is that a considerable number of languages might opt to sidestep this proposal?
My main point with n^2 is that "just support 2" is 4x as much surface area as just 1. I don't have as much of a preference of 8 vs 16, so someone else will need to decide which one we go with first. I could see 16 being used because that's what JS is already doing, or I could see 8 because C++ and Rust.
It's completely understandable that we shouldn't spec a 3rd or 4th encoding just yet, and nobody is suggesting that here.
Right, I'm not saying anyone is. I'm saying that 2 is 1 encoding too many for the state of the proposal as it is today. I'm saying that supporting more than 1 encoding is more of a nice-to-have than an absolute baseline minimal necessity that we cannot possibly succeed without. 1 is definitely required. Today we're shipping 0. I want to get us from 0 to 1 as soon as is feasible. 2 is a bit better than 1, but 1 is much better than 0.
for the wrong reasons
What are the wrong reasons? Can you be more specific? In particular what about the reasoning I've laid out is objectionable? I assume we have a mismatch in priorities. My goal is to build an ecosystem that is maximally interoperable. To that end, the feature not having shipped is a bigger problem than some languages being incentivized to ignore the feature due to performance.
That reasoning about n^2 and 2 is more than 1 and that's better than 0 is just so generic, and imho very weak compared to other valid points brought up in this issue. Of course shipping something in the first place should be the priority. I'm just trying to raise awareness that shipping something potentially useless for a significant number of languages is a high price to pay for being right on paper. Which I'm not sure that you are, btw, since what we are talking about is a detail of the broader specification, that must be specced in an extensible form anyway. So yeah, I'm questioning the theoretical reasons while advocating for an approach that considers practical aspects as well.
a high price to pay for being right on paper. Which I'm not sure that you are, btw
[...]
So yeah, I'm questioning the theoretical reasons while advocating for an approach that considers practical aspects as well.
I'm curious why you call what I'm saying theoretical reasons, when I'm talking solely about how much effort this will be to spec and implement. I'm not saying it's more theoretically pure to only have 1, I'm saying it's purely practical. Given finite engineering budgets, I would rather cut scope here than ship without say variant types.
shipping something potentially useless
It's not useless, it just doesn't address the 0-copy use case. I care more about "X and Y language can communicate without coordinating beyond IT" than "X and Y language can communicate with no extraneous copies". I care about the latter, indeed I think it's essential before we can declare success, but it's the sort of progressive enhancement we can put off for now.
Imagine a hypothetical. Think about it from a different language's point of view, if we were to go with only UTF-16. In that world, how much cost is worth it to support UTF-8 as well? What features are worth dropping in favor of other languages being able to pass strings with a minimum of copying? My position stays the same in that world, first-class UTF-8 support is not more important than variants, is not more important than callbacks. Obviously we'll want to support UTF8 eventually, because there's massive languages like C++ and Rust that will benefit. But until we ship something, nobody benefits.
In particular, I'm not saying 2 string encodings is definitively certainly post-MVP. I'm saying we do not have the data needed to make an informed assessment. In order to get that data, we need to spec something, implement it in engines, and then implement it in as many toolchains as is feasible. There is a cost-benefit analysis to be done. It is entirely possible that the costs of only having 1 encoding is too great to pay, and that increases the benefit of implementing 2. But until we can get that data, implementing >1 is an exercise in speculation. YAGNI and all that.
From a principled perspective, I believe that Interfaces Types is too enormous in potential scope for anything other than an iterative development strategy to be feasible.
Obviously we'll want to support UTF8 eventually, because there's massive languages like C++ and Rust that will benefit
Refreshingly honest, and I agree that supporting C++ and Rust is important. Still, whether that's more important than supporting the native encoding of web APIs is debatable. Not trying to make a point for either of those, but for both.
if we were to go with only UTF-16. In that world, how much cost is worth it to support UTF-8 as well?
Well, what's the worth of supporting C++ and Rust in this case? Again, seems equally important to me.
Given finite engineering budgets, I would rather cut scope here than ship without say variant types.
What features are worth dropping
I don't see how adding a second encoding inevitably implies dropping anything, considering that
we do not have the data needed to make an informed assessment
Sure, it is more work, and your time is valuable. I understand that, so being careful not to make any promises is perfectly reasonable. But not even considering, just a "I don't say no, but I mean no" for one questionable reason after another makes me wonder. If that's the way it's done, then we can as well stop asking for input and let the big players do their thing.
Still, whether that's more important than supporting the native encoding of web APIs is debatable
That is a different argument though. Again, I don't particularly care whether we go with UTF8 or UTF16, so I'm not the one to convince on that front. I'm just in the camp that we pick one and only one until we get performance data that says it's enough of a problem to be worth solving.
But not even considering, just a "I don't say no, but I mean no" for one questionable reason after another makes me wonder.
I am considering. I can't say "yes" or "no" without a real benchmark that would be improved. I've outlined the criteria I need to see to come to an opinion. Deciding "we definitely need this" or "this is definitely unimportant" without any concrete numbers is premature. The best we can come to is "we definitely need to consider it," and trying to conclude anything further is aimless.
What have I said to make you think I'm not understanding your concern? What have I said that came across as "no, and that's final"? We're not going to think about multiple encodings until later on, but I'm positive this issue will come up when we start getting implementations in multiple languages. For that we really will need your input as to what works in practice vs not.
If that's the way it's done, then we can as well stop asking for input and let the big players do their thing.
For clarity, I don't have any authority here other than being occasionally persuasive. I'm also working exclusively on the toolchain side of things, so it's not even any work for me personally on the engine side of things.
I don't see how adding a second encoding inevitably implies dropping anything,
It's more that every feature added necessarily delays the schedule. The choice is between shipping a v1 in 2021 and a v2 in 2022, vs shipping nothing in 2021 and v2 in 2022. I'm arguing for the former. No matter how important v2 features are, they aren't more important than v1 features. My model here is that one string encoding is v1, a second is v2. Data can convince me that a second encoding is actually v1, but at this point in time the performance overheads are not a problem anyone is having. Not having Interface Types at all, is.
until we get performance data that says it's enough of a problem to be worth solving.
at this point in time the performance overheads are not a problem anyone is having
Having to re-encode not once but twice (UTF16->UTF8->UTF16) to call a web API from AssemblyScript or .NET or whatnot, for instance to set the contents of a lot of divs in an SPA, is a foreseeable problem, though, and not addressing that because "we did not have performance data" would be completely beyond me.
The choice is between shipping a v1 in 2021 and a v2 in 2022, vs shipping nothing in 2021 and v2 in 2022
Not having Interface Types at all, is.
It is, but this also has the potential to delay everything not UTF-8 even longer since we'll have to go through the set up and tear down of two specification processes, so make that rather 2023. Not necessarily the right thing to do, considering that the v1 implementation will be so bad for some that nobody in their right mind would use it for anything half-way performance critical. How do you plan to get feedback from that part of the spectrum, if it's not worth the effort to implement the proposal in the first place? And what do the v1 languages talk to? This is just broken by design.
I was under the impression MVP already cover all C++/Rust needs and most of proposals like GC, tail recursion calls, function references more focused on remaining managed and functional languages. And interface-types should be a bridge not only for efficient interop between host and WebAssembly but also efficient interop between different languages. In this case, we again have a situation where everything closes to interop C++ and Rust. And all the rest languages again excluded from progress.
I see the point that JGravelle is making. Trust benchmarks, not intuition. I wouldn't say it's necessarily 100% foreseeable that this will result in an unacceptable performance hitβthe truth is that we don't know yet. Mind you, now that I think about it, for most of the early 2000s most Windows software passed strings to the OS in Windows codepages, which the OS then had to convert to UTF-16 every single time, and the performance hit wasn't even something that most people noticed. Even now a lot of Japanese software uses Windows-932 (Microsoft's nonstandard variant of Shift-JIS) instead of any sort of Unicode encoding. This is actually a much more expensive conversion than converting between UTF-8 and UTF-16 since it involves lookup tables and not just bit fiddling, meaning that it involves lots of memory accesses and bad cache locality. On the other hand, strings passed to the OS are mostly short things like paths, so it could be that the situation is different here. All the more reason to benchmark instead of speculate.
for most of the early 2000s most Windows software passed strings to the OS in Windows codepages, which the OS then had to convert to UTF-16 every single time
But didn't they eventually fix that for... reasons?
a lot of Japanese software uses Windows-932 (Microsoft's nonstandard variant of Shift-JIS) instead of any sort of Unicode encoding
Are some developers of these programs thinking about changing it because... reasons?
All the more reason to benchmark instead of speculate.
I'm not speculating. This is objectively bad (I have implemented encoding stuff like that more often than I'd like for... reasons), and just because other software was or is happily wasting CPU cycles for convenience, calling it a day instead of learning from it is something I cannot understand.
Perhaps just imagine, hypothetically, that GC would implement string fields as interface types strings. "Read a field?" "Sure, but pipe it through the CPU a couple times each time you do." How is that not obviously a bad idea? Then, imagine someone has a use case that is not this hypothetical GC, but purely interface types, where a string is read equally frequently. How is that less bad?
In this case, we again have a situation where everything closes to interop C++ and Rust. And all the rest languages again excluded from progress.
And that comes on top of that. I'm trying not to stress this point too much, though it sometimes shines through in my comments, but there are not only technical issues with biasing this proposal towards C++/Rust. It essentially tells the rest of the ecosystem that they are not important enough, even though these languages might rely on this proposal more than C++/Rust do due to being managed first, systems second, where most of their interop is then-horribly-inefficient web APIs. For some this might seem like a non-inclusive, crooked specification process, which is something we as a community should strive to avoid if we want this proposal or WebAssembly in general to live up to its full potential. Don't ignore the voices, embrace them.
Looking through the working notes, it seems that something like the hypothetical GC scenario outlined above is closer than initially thought.
Also facepalmed when I saw string.size not supporting UTF-16 right away.
Q: "Determine the length of a string obtained from a web API before..."
A: "Sure, allocate, copy it to memory, compute its length with a non-native algorithm and free it again. Then do something with it."
Q: "Store it in native encoding?"
A: "Well, allocate, copy it to memory, compute its length with a non-native algorithm, allocate another time, re-encode, free one..."
Q: "γ½(γ_Β°)γ"
A: "Oh, and also don't forget that it has been re-encoded once before becoming that string in the first place."
Q: "(β―Β°ΠΒ°)β―οΈ΅/(.β‘ . )"
Not even exaggerated since it is unlikely that anyone will optimize for this case considering that "UTF16 may be added in the future". Unbelievable how this can be considered MV.
Some implementation notes:
I have been prototyping Wasm calls to the JavaScript binding code in Chrome (without calling through imported JavaScript glue code.) This is the "Host Bindings" scenario.
My prototype converts ASCII strings as (i32 ptr, i32 length) pairs to new JavaScript String objects which are passed to Chrome where they are converted to a C++ string object (Both V8 and Blink, the HTML rendering engine in Chrome, support 1 and 2 byte strings internally).
The prototype can run an HTML Canvas animation demo which makes many calls that pass short strings (setting fill and stroke style to colors passed as short ASCII strings, and passing a fill rule (enum value "nonzero"). You would think performance would take a hit from creating so many strings just to pass parameters, but there is no measurable impact in the demo.
Strings are performance critical in browsers, so there is a lot of machinery to make them fast. Both V8 and Chrome also support the concept of internalized / atomic strings, which can be very fast when the set of actual strings in use is bounded.
Thanks, these are interesting insights. I'd like to add to that, though, that one might not want or be able to ship all this machinery with a WebAssembly module to cover the other side of the equation. Like, the inverse scenario is calling a web API that returns a string, passing it to the module, decoding it into its memory, then re-encoding to another chunk of memory so the module can use it in its native representation. As such, your scenario seems to favor a very specific kind of problem that just so happens to be half-way efficient. As a counter example, take a markdown parser for example, that receives strings from a node process doing the I/O, parses them, and returns HTML. Make that a service doing this over and over again. Use cases involving JSON are similar.
But didn't they eventually fix that for... reasons?
As in Microsoft? They didn't change anything. Windows still works exactly that way. If your software doesn't pass strings to Windows in UTF-16, it still does a full-on conversion. Or do you mean developers of programs switching over to use UTF-16 directly? Many have done that, but I've never seen performance as one of the motivations to do so. Rather, I've only ever seemed it framed in terms of supporting text properly. I haven't heard any real mention of performance. Plus a lot of software these days is moving away from UTF-16 internally to UTF-8 instead, because of the effort to standardize around it, which means that they end up converting every string before they pass it to Windows.
Are some developers of these programs thinking about changing it because... reasons?
In Japan? Uh... even in 2020 Windows-932 is neck-and-neck with Unicode in Japan when it comes to Windows software. Unless a program is meant for export to other countries, I would be surprised to see it use Unicode at all. And just like in other places, developers who are switching are doing so to avoid mojibake and unsupported characters, not for performance.
Now, let me point out that I think I'm on your side when it comes to this, DcodeIO. I would like to see no-copy UTF-16 support. If it never gets added, I will see this as a failure of the design. But what I'm saying is that this is a feature that a lot highly successful software lacks. When it comes to the minimum viable product, interface types will still be a viable choice in the meantime.
Conversion overhead likely is minimal when dealing with short strings, but when dealing with longer strings, conversion takes not only time but also memory. For OmnICU, we'd like the ability for WASM to efficiently perform reads into a host-side UTF-16 string without having to copy the string into a WASM buffer.
Now, let me point out that I think I'm on your side when it comes to this, DcodeIO. I would like to see no-copy UTF-16 support. If it never gets added, I will see this as a failure of the design.
Yeah, and I'm extending this to the v1, in that it would be a failure of design for a too large part of the ecosystem until there is a v2, so should be part of a v1 to prevent that failure from existing for years in between. Heck, what if nobody steps up to drive this forward after a v1, because their C++/Rust needs are covered or their engineering budget is exhausted, as was hinted? Strings are just so crucial on the web that spending some additional time (which I still think is not overly much considering that this has to be designed extensible in the first place) to get this right for both C++/Rust and anything functioning like Web APIs will pay off both technically and from a ecosystem point of view. Both cases appear equally important, and most of us agree, so why is there still stuff like "UTF-16 may be supported in the future" in this proposal / the working notes?
The proposal itself even states "One of the primary motivations of this proposal is to allow efficient calls to Web APIs" (emphasis mine). But Web API(-like) calls are exactly where it lacks. If one targets Web APIs, one should rather design it after Web APIs, not after C++/Rust APIs - but again, I think both are important. It even explicitly states "without committing to a single memory representation or sharing scheme"...
But to be not exclusively negative here: I certainly appreciate that you guys are taking the time to discuss, even though I have apparently evolved to the unfriendliest person in the Wasm world.
Read through a couple more working notes, and decided to give a variation of Counting Unicodes a shot, assuming UTF-16. Didn't fill out all of it yet.
Counting Unicodes (UTF-16)
Preamble
This note is a variation of unicode_count.md, but assuming that the common encoding of linked modules is UTF-16, i.e. where the least common denominator is a browser and other modules adapt. It is part of a series of notes that gives complete end-to-end examples of using Interface Types to export and import functions.
Introduction
Importing Unicode Counter
Exporting a Unicode Counter
Shared Nothing Linking
Combining Import and Export Adapters
Summary
The countCodes function is fairly complex (one will definitely think twice before filling out the paragraphs above); however, not filling these out illustrates many of the problems arising from basing the initially supported encoding scheme on foreign APIs instead of Web APIs.
Using UTF-16 instead of UTF-8 as the initial encoding solves many of the problems, except that the proposal cannot simultaneously aim at minimizing the linking of C++/Rust webAssembly modules, which is not its primary motivation, unless UTF-8 is supported as a second encoding.
[Counting Unicodes] https://github.com/WebAssembly/interface-types/blob/master/proposals/interface-types/working-notes/scenarios/unicode_count.md
[STRING i32 pairs for UTF-16LE] #13
Now, let me point out that I think I'm on your side when it comes to this, DcodeIO. I would like to see no-copy UTF-16 support. If it never gets added, I will see this as a failure of the design. But what I'm saying is that this is a feature that a lot highly successful software lacks. When it comes to the minimum viable product, interface types will still be a viable choice in the meantime.
Basically that yeah. If it's an obvious 10x slowdown on each binding, but binding code is only 0.1% of the overall application, a 1% performance improvement is not the lowest hanging of fruits. It may be 100x of 1% of programs, or it may be 2x of 10% or so on and so on. Even a 2x overall program slowdown, while obviously suboptimal and unfortunate, doesn't make an initial version non-viable, it just limits the use cases where it's acceptable.
Have been thinking about this more than I probably should, and my position has changed slightly meanwhile. But let me explain:
I am genuinely interested in making both the Web and the Assembly in WebAssembly succeed, so I'm now advocating for WTF-16 as the primary, and first to-be-supported encoding scheme, since it is backing (most) Web APIs (primarily targeted by this proposal as of today), while recommending to also support W/UTF-8 (leaving this up to those more familiar with the affected languages) in a v1 since efficient calls from, to or between languages not originally developed for the web, like C/C++ and Rust, seem also important.
On the technical side this guarantees that any language exchanging data with a Web API has to do one re-encoding at a maximum, so not changing the amount of work languages using W/UTF-8 (or any other foreign encoding) have to do anyway, respectively a zero-copy operation (otherwise memcpy) at a minimum. This is 2x better than the originally proposed initial encoding scheme, in that no language will have to re-encode twice, which would be absurd especially when interfacing with WebAssembly from JavaScript, given the scope of this proposal.
If a language or API requires well-formed UTF-16, it must perform a check on its side of the boundary. In the worst case, the receiver of the string has to perform the sanitization measures appropriate to make the string function according to the assumptions made by its implementation (different implementations may have different requirements), while in the best case, just the additional check is performed. Advantages of speccing WTF-16 instead of UTF-16 also include that only those APIs actually requiring well-formedness involve extra work, as well as that the specification does not have to deal with implicit trapping behavior by instead making uncommon special cases an ABI detail for language implementers to cover ("If string X is not well-formed, we'll do Y").
Organizationally it has been hinted that more information may be needed whether calls between languages not using an encoding scheme directly compatible with Web APIs, for example to aid shared nothing linking between those, should be made equally fast in a v1, or whether this can wait till a v2. Hence I'd like to encourage everyone with an opinon on why W/UTF-8 should be supported as a second encoding scheme in a v1, to participate in this discussion, since delaying it to a v2 may involve certain call overhead until there is a v2.
Last but not least, I'd like to apologize that it has taken me so long to come to this conclusion, and the inconvenience this may have caused for those favoring exactly one initial encoding scheme, even though it has been hinted to me multiple times during the course of the conversation that an encoding scheme fitting actual Web APIs may be the better alternative, which totally makes sense to me now and I hereby support wholeheartedly for the above reasons.
Some more browser implementation background might be useful:
While JS strings/DOMStrings are spec'd to be a sequence of 16-bit code units (potentially ill-formed UTF-16, aka WTF-16), the representation of these strings in the browser is often not.
Most(/all?) JS engines attempt to represent a JS string with a 1-byte-char representation (SM uses latin-1). (JS strings also have a bunch more representations like: inline, lazy-concatentation-DAG, subset-of-bigger-string, atom..., and a single string can change representation as it is used.) Thus, when a JS string flows into wasm via a string interface type, (which, note, is always a copy operation) I think there should be no net benefit, and probably even a net loss (due to more bytes being touched), in having the destination linear memory encoding be WTF-16.
It's true that most places in Gecko still use nsString which stores char16_ts, but:
- as Bill said, there's often a copy between JS and Gecko anyways (b/c JS strings are so wacky) at the JS/browser boundary
- we've actually been moving toward UTF-8 in Gecko, and I expect this trend to continue. (Using WTF-8 when the value is a
DOMStringnotUSVString.)
Thus, in the short-term, I don't anticipate any win with using a WTF-16 encoding in linear memory when passing a string to the browser and, over time as we optimize the Web IDL call path (informed and enabled by interface types), UTF-8 would enable the best final performance.
Now a separate use case that I haven't seen mentioned explicitly, but I expect is perhaps in people's minds is AssemblyScript, which, owing to its JS heritage, exposes WTF-16 code units. IIUC, AS is a pretty popular way to write wasm, so I think that's an important use case. But perhaps the discussion we could have first (probably best in a separate issue, maybe on the AS repo) is whether AS could change its internal representation of strings to be more like that of JS engines in a way that is UTF-8-compatible in the common case. (The hard case, of course is random-access, which devs expect to be O(1), but that can be addressed in various ways, as JS engines do.)
I think WTF-16 encoding doesn't fit with the conceptual model here at all. Interface type strings are supposed to be a sequence of Unicode characters, with the encoding abstracted away. Malformed UTF-16 such as that which WTF-16 would allow does not encode Unicode at all. If you allow WTF-16, then you have to allow WTF-8 on the other side as well, since otherwise it wouldn't be possible to losslessly transcode between the two. So this ends up not being a decision just about a single encoding, but affects all of them. I think if you want to pass around potentially malformed UTF-16 (such as Windows filenames), it might be a good idea to treat it as an array of u16s anyway, because the encoding is important to the data, and not something you want interface types to be abstracting away.
Thanks Luke, I always appreciate learning more about Gecko's implementation details. Now that I've explored this more from an existing specification perspective, your comment also reminds me of another important aspect that led to my conclusion:
It appears premature to put specific engines (or particular optimization strategies for that matter) over the specifications of the whole web that are backing actual Web APIs, similar to how putting specific languages or tools over these is. For instance, there may be such efforts in one engine, but perhaps not in all engines, these might succeed, or might not. All speculation regardless of expectations. Optimizations can change freely, while the web platform has specifications for the higher goal of openness and compatibility between its parts.
But perhaps the discussion we could have first (probably best in a separate issue, maybe on the AS repo) is whether AS could change its internal representation of strings to be more like that of JS engines in a way that is UTF-8-compatible in the common case. (The hard case, of course is random-access, which devs expect to be O(1), but that can be addressed in various ways, as JS engines do.)
That sounds like a fantastic discussion to have. Apart from O(1), one of our design goals is to ship tiny WebAssembly modules. Do you think both can be achieved? If you prefer, we can continue on the AS repo.
malformed UTF-16 (such as Windows filenames)
I believe JavaScript strings are a much better example in this context. Allow me to elaborate:
because the encoding is important to the data, and not something you want interface types to be abstracting away.
This is a very valid point, but in the case of Web APIs and JavaScript using an encoding other than one compatible with the specified (WTF-16 is the encoding there after all) inevitably leads to the original data not being represented exactly anymore or being incompatible with IT. As such, yeah, the encoding is indeed important to the data. Not sure which perspective weighs more.
supposed to be a sequence of Unicode characters, with the encoding abstracted away
Right, the explainer states "string is defined abstractly as a sequence of Unicode code points and does not imply a Unicode encoding scheme". Presuming that this means "exclusively perfectly valid unicode code points", in this context UTF-16 would be the logical choice to at least solve the double-reencoding problem, but is still prone to the original data (of Web APIs) potentially being incompatible in some way. Now I'm not sure of the full implications this has. Might be hard to explain to a JS developer who'd just like to pass their string to their imported WebAssembly module, that they may not even be aware of being a WebAssembly module. I guess we either get compatibility with JS, or we don't?
But perhaps the discussion we could have first (probably best in a separate issue, maybe on the AS repo) is whether AS could change its internal representation of strings to be more like that of JS engines in a way that is UTF-8-compatible in the common case. (The hard case, of course is random-access, which devs expect to be O(1), but that can be addressed in various ways, as JS engines do.)
It really hard to do in efficient way. Dart for example still use UTF16 for encode code points.
The only possible option is to create another type of string with its own interface and utf8 code points code points which could properly handle grapheme clusters as well for example. But it different story
@lukewagner (which, note, is always a copy operation)
Why would it copy? string is an abstract opaque type, and JS strings are immutable, so it shouldn't need to copy.
Of course the string -> linear-utf8 conversion operation will need to copy (and potentially re-encode), but merely converting from JS string to string shouldn't copy, right?
@dcodeIO Yes, I think it should be possible to achieve good code size; but happy to talk about that more in a separate AS issue. Regarding browser implementation details: I don't think what I described is unique to our impl (other than perhaps the goal of moving more internal representations to UTF-8, which was a minor point). My high-order point is that having wasm use WTF-16 to represent strings in linear memory doesn't really buy much, and may hurt compared to UTF-8.
@Pauan string is an abstract type, yes, but interface-typed values necessarily only live for the duration of the adapted call, so unless a string gets droped without being used, it has to get copied into linear memory (or, in the future, GC memory, which will also require a copy since JS strings have a majorly different layout than anything you can describe with wasm GC types).
The only possible option is to create another type of string with its own interface and utf8 code points code points which could properly handle grapheme clusters as well for example. But it different story
Dividing into grapheme clusters requires up-to-date Unicode tables and fails as soon as you run into an unknown character. I think it's definitely beyond the scope of something simply meant to abstract away the encoding.
(The hard case, of course is random-access, which devs expect to be O(1), but that can be addressed in various ways, as JS engines do.)
It's important to note that UTF-16 isn't O(1) random access either, but O(n) just like UTF-8. Any code that treats it like it is is fundamentally broken and relying on an assumption that hasn't been true for over two decades.
It's important to note that UTF-16 isn't O(1) random access either, but O(n) just like UTF-8. Any code that treats it like it is is fundamentally broken and relying on an assumption that hasn't been true for over two decades.
Oh, sorry, I was referring to the JS str[i] operation which is specifically defined to index code units, not code points. There is definitely an ambient assumption in JS around the web that str[i] is (amortized) O(1). (The "amortized" part is key in allowing the cornucopia of JS string internal representations.)
Oh, you mean JS engines don't always store it as UTF-16 internally? Do they use a single-byte encoding when they can?
Yep.
Ah, Latin-1! That's clever, because you can easily convert it to UTF-16 just by adding a zero every other byte.
@lukewagner So, if I'm understanding you correctly, you're saying that the conversion from JS string to string does not copy, but the conversion from string to linear memory does copy (which is expected). That was my understanding as well.
My high-order point is that having wasm use WTF-16 to represent strings in linear memory doesn't really buy much, and may hurt compared to UTF-8.
From an implementation perspective of contemporary engines this makes sense, yeah. What I'm mostly concerned about however is that Web API/JS strings cannot in all cases be converted to UTF-8 without either trapping or making an educated guess and modifying the data, which appears to be a greater problem due to the sheer amount of subtle issues this may introduce (code randomly traps, strings randomly don't compare equal to each other and whatnot) than adapting on the engine level. For instance, a JS developer might not necessarily be aware that one of their dependencies is a WebAssembly module, and their program might work fine most of the time, but crash or have undefined behavior occasionally. Now one could of course argue that nobody should ever create a potentially ill-formed UTF-16 string, but considering that there's a lot of code utilizing binary strings for example (window.atob anyone?), and not every developer is aware of the problematic, would it be wise?
This gets more complicated the more that I think about it. There are even more possibilities for invalid UTF-8 than there are for UTF-16, and unlike UTF-16, invalid UTF-8 has no logical representation as a series of code points: it's simply meaningless in Unicode terms, to an even greater extent than invalid UTF-16 is meaningless. So, this means that languages such as C which in practice store strings as UTF-8 but allow it to be invalid would have the problem that their strings cannot be losslessly converted to UTF-16 or even WTF-16.
@dcodeIO That's a good point, and one I wasn't clear on above. Although the explainer is woefully out of date (I plan to update it once I finish some work on linking/module-imports that I think Interface Types should rebase onto), it might be worth reading the Export receiving string section, which proposes:
- a
stringis a sequence of Unicode code points, not Unicode scalar values, and thus can contain any potentially ill-formed UTF-16 - string lowering should allow several options (expressed as an instruction immediate) for how to handle surrogates characters by one of: trap (what Rust and any other proper UTF-8 receiver would want), replacement character (the
TextEncoder.encodedoes), or WTF-8.
It's not explicitly mentioned, but I think it'd be natural for string lifting to accept WTF-8.
And thinking about it again, if the most permissive options to lift/lower were WTF-8, then technically string wouldn't be any sequence of code points, but rather "sequences of code points that don't contain surrogate pairs", just like WTF-8. (Such a mess...)
@Serentty Right! And, even better, latin1 allows O(1) str[i] without inflating to UTF-16 :) That being said, an alternative I've been considering in the future is to switch out "latin1" for "7-bit ascii" which has both the previous properties and the additional property that 7-bit ascii is valid UTF-8. Thus a JS string could have 3 states: 7-bit, wtf-8 and two-byte-chars. The former is preferred, falling back to the middle when code points are out of range and only lazily inflating to the latter when a wtf-8 string is str[i]'d (which should be uncommon).
Hello! I just opened an issue on the AssemblyScript side here: AssemblyScript/assemblyscript#1263
Hoping to move some of the AssemblyScript specific implementation details there for now. Though, I would like to note that string encoding will be a problem for languages other than assemblyscript for the MVP / down the line. Even if they haven't been as vocal about it π Thus, I think it is worth keeping this issue open as well.
Looking forward to the discussion on that issue, but as well as, continuing the one here π
Thank you everyone!
One thing I see I missed in the comments above is just how many other languages use UTF-16. It's certainly not realistic to expect them all to switch their compilers/runtimes to use the implementation scheme I mentioned above, which means if we don't allow UTF-16, we'll end up with a bunch of unnecessary temporary copies.
I suppose this makes me a lot more in favor of including both WTF-8 and WTF-16. Practically speaking, engines can pretty easily include all 4 conversion combinations and the transcoding can be efficiently fused into the copy loop (which already has to validate WTF-8).
Forgive me if this has already been mentioned, but why does Wasm, a low-level ISA, care about something as high-level as character encodings? All strings should be seen as bit buffers stored in memory. The Web IDL receiver should interpret it as a documented encoding. Robust code should already check for well-formedness at potental external entry points. I'm personally not sure about having VM implemented copying (which may be inefficient).
Wasm can't expect to allow every single encoding, which will always cause problems for some languages.
For example, I believe that nothing has been mentioned for UTF-32 so far, yet I've used it myself with Wasm, in systems programming, and might even want to use it with the interface types.
It seems more practicial to support a single encoding and having all languages copy to be valid in that encoding.
@00ff0000red Interface types are an additional feature on top of Wasm which provides high level interop between Wasm and the host environment (such as JS).
Different Wasm languages have different encodings, and for performance reasons they want to avoid copying, so that means multiple encodings are needed. Since UTF-16 is extremely common in programming languages, it makes sense to support it, otherwise UTF-16 languages will be at a performance disadvantage.
I don't think the goal is to support every encoding, just the most common ones (which right now mostly means UTF-8 and UTF-16).
Just to add to that, the interface type doesn't say anything about Unicode encodings only that the result of decoding a string and the input to encoding a string is an abstract sequence of unicode scalar values. Since the last round of discussions above, #122 significantly generalized the ability to decode/encode strings (which are now just (list char), and thus can use the general list lifting/lowering instructions) such that there should be no problem decoding or encoding UTF-16LE without intermediate copies.