GC story for strings?
dcodeIO opened this issue Β· 90 comments
As of the MVP document, strings can be expressed as either an (array i8) or (array i16) per a language's string encoding, but with only one character at a time being accessible with array.get and no compatibility with JS strings I'd imagine that the MVP will not be very useful beyond pure experimentation if a language wants to target GC primarily. Think comparing, hashing, substrings, etc.
The Post-MVP document for instance mentions bulk-operations, in particular array.copy which is a start but won't help with strings, and to my surprise doesn't mention strings so far.
Hence I have questions:
- Are there any plans yet to address strings in particular?
- Is it reasonable to expect that Wasm strings and JS strings will be API-compatible eventually, so Web APIs can be called efficiently from Wasm and vice-versa?
Interface Types are supposed to handle the conversion from one string encoding to another
Yeah, this is somewhat a continuation of my infamous thread on UTF-16 over at interface types. Give it a read if you haven't yet, it's great. There is indeed some overlap between the two, and my impression is that interface types will be useful in the absence of GC, but we'll still need a better story in the presence of GC, most likely integrating with the parts of interface types that apply to both.
To give an example: A language may chose to not rely on linear memory at all, but use GC objects exclusively, including to represent strings, and there may be the expectation that these strings can be passed to or received from Web APIs, and that dealing with such strings internally to a module is reasonably efficient. Neither interface types nor GC, in its current forms, are very useful there, unfortunately.
The GC proposal is not responsible for this kind of interlanguage interop. Different languages represent the same high-level concepts differently in their low-level representations for a variety of reasons. For example, OO languages need strings to have a v-table with references to an i-table and to reflection information, and each of those would need to be organized according the specific OO language's APIs. Some languages/libraries represent strings as trees. Then of course there are the issues with encodings. Given the diversity, it seems difficult for there to be a concept like "wasm string" that different systems could reasonably use for interop without copying or wrapping. This is why the Requirements document explicitly lists this kind of interop as a non-requirement.
The requirements document also states the opposite in critical success factors, and I am very surprised to see this argument being made. If we don't have a good story for strings in Wasm GC, and given that strings are by far the most common higher data structure in most programming languages (and very common to be consumed and produced by Web APIs), I'd go as far as to say that we are about to make a critical mistake endangering the success of Wasm as part of the Web ecosystem. Not tackling this will significantly undermine the viability of AssemblyScript in favor of systems languages compiling to Wasm for example, and as such some might see this as an unexcusable standards body decision. That's exactly what I was concerned about in my comments on the requirements document.
There is a middle ground. WebAssembly (and specifically GC) can provide the infrastructure for interop without baking specific interop designs into its core design. So someone can design a specific way to represent strings using GC/wasm types, and then various languages can choose to use that particular representation to facilitate interop. This includes JS, though you'll have to get the browsers to agree to do so. Or someone can design a module interface (with an abstract exported type) for string values and operations that wasm modules can link to and which each browser can implement according to how it implements JS strings. (For this to be efficient, you'll want a better compilation model though, which would be useful for a variety of reasons.)
The high-level idea is that interlanguage interop is an ecosystem problem, not a low-level problem, and it is best served in WebAssembly by developing low-level mechanisms that can support various ecosystem solutions.
To me the middle ground is to ensure good interoperability with JavaScript as a critical success factor, not necessarily between arbitrary languages, and I was under the impression that this is what browser makers and the overwhelming majority of the group want as well. The arguments and suggestions you are bringing up might be worth to explore, of course, who knows what we'll be getting out of it, but so far I don't see a clear merit in making this any more complicated than it needs to be.
Concur with Ross here. The kind of interoperability you seem to be asking for is already impossible with JS. Most engines, including V8, use a multitude of internal representations for JS strings. The differences are 'papered over' at the source level.
I am completely stumped by the responses and will have to think about it for a bit. Might well be that betting on Wasm GC and trusting you with the changes made to the requirements document instead of following up on my points was a mistake.
@dcodeIO It would be helpful if you would more concretely illustrate the use cases you are concerned about. You mention wanting something like array.copy for strings and wanting to interop with JS strings, but these two desires seem to be contradictory given that JS strings are not represented in real engines as arrays. (Here is a summary I found of V8 string representation.) I feel like your needs might be met by having a host-implemented string library (as I suggested above) with a function for constructing a new string from the contents of a given mutable array (it has to be copied because JS strings must be immutable), and possibly another for constructing a new string from an immutable array. But without concrete illustrations of those needs, I can only guess.
There seem to be two needs at play here: 1) String manipulation in the GC proposal needs to be competitive with string manipulation in linear memory and 2) JS interop is generally important, and interop with JS strings is particularly important.
On the first point, I agree that the current MVP seems to leave a lot on the table since it can only access one array element at a time. It seems that the best solution would be to have instructions akin to the variety of MVP load and store instructions that operate on i8 arrays (or arbitrary integer/number arrays?) rather than linear memories.
On the second point, the complexity and diversity of string representations in JS engines make it infeasible and undesireable to expose the underlying structure directly to WebAssembly, as @RossTate pointed out. That leaves two possible paths for first-class JS string interop: A) a "wasm string" type is introduced with instructions corresponding to all the common string operations that can be an abstraction directly on top of an engine's native JS string representation, or B) the JS API defines a coercion from JS strings to GC types and vice versa to make passing strings into and out of WebAssembly simpler.
Both of these solutions are problematic. (A) does not fit in with the goal of having the GC proposal represent only low-level representation types and as Ross mentioned, most languages would not be able to use a native "wasm string" as their string type anyway. There's also practically very little difference between (A) and just importing a string type and functions on it from the environment, which will be possible with type imports. (B) is problematic because there is no good way to choose which representation of strings as a GC type to choose. Choosing any particular "blessed" representation would inherently favor some source languages over others, so it would be better to produce a more general solution like interface types, that can work for all languages.
most languages would not be able to use a native "wasm string" as their string type anyway
Actually, I think a first-class immutable string primitive that "papers over" the underlying byte encoding could still help a lot, even for languages that have a different native string representation.
For example, while C# might not be able to adopt "wasm string" as the underlying representation of System.String, C# could expose this new type as 'System.Interop.WebAssembly.WasmString' and let C# developers choose if they want to use the new zero-copy WasmString or pay the cost of transcoding to get a classic C# string.
I think option (A) is worth pursuing further.
One specific point: We are discussing strings here on the GC proposal, but the issues are separable. It's very possible strings do not fit in with the goals of the GC proposal (as @tlively just mentioned), but also that @dcodeIO 's requests are valid and should be addressed by adding a string type to wasm in another proposal.
My perspective is that we should add a string type to wasm, perhaps doing that in a separate proposal all by itself. The benefit of a string type would be similar to string types in other multi-language VMs (CLR, JVM, and for that matter JS) - they allow quick and easy passing of string data between languages. C won't use it (just like it won't on the CLR), but I understand @dcodeIO to be asking for AssemblyScript to be able to interop with JavaScript the way that TypeScript and many other languages that compile to JS can. That's a good goal!
Great points above. And I'm realizing (thanks to @tlively's post) that I overlooked a concern raised in the OP about arrays. Sorry @dcodeIO!
On arrays, one thought that's come up a few times is having linear memories within GC objects. This came up in at least #94 / #109, and in WebAssembly/interface-types#68 there is some discussion of slices, where I think it would be nice from an interop perspective for a slice to be either a view into a linear memory or a view into non-referential data within a GC object.
I like the idea of a separate proposal for strings. In Java or equals languages you can hold the "native" string inside the language string like the native array inside an array object.
Currently a string from JavaScript can only hold as a erxternref. I can not check an equals, etc. It is like a black box.
That with Interfaces types a string must convert on every call. This will never effective.
I expect even a separate "add a string" proposal is fraught with peril because there are so many fundamentally-incompatible low-level design choices with strings that will ultimately force the proposal to bias towards one subset of languages at the expense of others.
One alternative route for JS that I've been imagining is:
- the wasm module preimports a type that it uses as the host's string type, as well as function preimports for each string operation which use the imported string type in their signatures
- the wasm module gets preinstantiated with some
WebAssembly.StringTypesingleton (defined by the JS API to mean "the JS string type") and JS built-ins for each string operation (adding new ones for operations that don't already exist) - because preimports are used, the wasm engine is encouraged to statically specialize the machine code to the specific imported values, achieving essentially the same performance as core wasm types/instructions (to a knowing engine)
The advantage of this approach is that, when targeting a Web embedding, a wasm module can use operations that precisely match JS strings while a module targeting, say, a future Go embedding could import a completely different set of Go-string-appropriate operations.
Ultimately, if your aim is to write a single module that can be used across many different hosts, I think you're going to end up with an impedance mismatch unless you make a copy at the wasm/host boundary into and out of the wasm module's own private string representation. But if you're targeting only a single host and the source language is intentionally embracing the host's string type, then the preimport approach seems pretty good.
I expect even a separate "add a string" proposal is fraught with peril because there are so many fundamentally-incompatible low-level design choices with strings that will ultimately force the proposal to bias towards one subset of languages at the expense of others.
I think picking a fixed string type is inevitable, and yes that will bias to a subset of languages, but I disagree with the assumption that that's a downside!
JS, the JVM, and the CLR all have a single string type, and all show how successful that can be in letting various languages pass data back and forth easily and efficiently. If we think there's a better example, or that there's something wrong with their approach, I'd be curious to hear details about that. Otherwise I think we should learn from their success - as those VM ecosystems show, some languages will choose to use the blessed string because it's compact, convenient, and efficient, that works with a lot of other code. That's good for the ecosystem, I think.
Picking one string type doesn't prevent other languages from not using it - there is no downside to them. And we'll still want Interface Types for the case where a language wants to do strings its way and convert at the boundary. Also I agree the preimport idea is interesting too, and can help with perf issues.
JS, the JVM, and the CLR all have a single string type, and all show how successful that can be in letting various languages pass data back and forth easily and efficiently. If we think there's a better example, or that there's something wrong with their approach, I'd be curious to hear details about that. Otherwise I think we should learn from their success - as those VM ecosystems show, some languages will choose to use the blessed string because it's compact, convenient, and efficient, that works with a lot of other code. That's good for the ecosystem, I think.
One obvious downside to fixing strings to something like JS strings is that they effectively have to be encoded as UTF-16 (at least for non-ascii strings) since that's how JS indexes strings. There's no efficient way to hide that from the user without somewhat subtle perf cliffs (e.g. re-encoding the string in a different format).
I think picking a fixed string type is inevitable, and yes that will bias to a subset of languages, but I disagree with the assumption that that's a downside!
Biasing a subset of source languages is a clear downside because it directly contradicts the design goals we publish in our specification. I agree that we should learn from the experiences of the JVM, the CLR, and JS, but only when those lessons are compatible with WebAssembly's firmly established design goals.
I agree that would be a downside there. A decision here would need to consider lots of factors (most of which I personally don't know enough about atm), including that.
In general not every new feature will help every language or not help them equally. Things like Tables, exceptions, SIMD, GC, etc. may not end up used by every language, either because they choose not to or they can't - C can't use exceptions, for example, but it isn't "harmed" by us adding exceptions. Some new features may "bias" towards the languages they make sense for, but that's ok, because wasm as a whole still aims to support all languages.
To be clear, I'm not saying "let's do a string type and stop thinking about languages that won't use it!" - we'll still want Interface Types and other things in this area, as I already said.
I dug into the CLI spec and found that string is treated essentially as an abstract type (i.e. a (pre)imported type) except for one instruction: ldstr $index specifies an index into the program's string metadata and returns the corresponding string value.
@lukewagner How would a preimport change the behavior of the string type in WebAssembly? It can give some performance improvements. But it would ever of type externref. Then the usage of ref.eq or any type check will not possible.
You can preimport the (abstract) string type, and that preimport can express requirements on that type such as "must support equality".
I have no idea how I this should be possible. Also if I look into the Type Imports and Exports. But if this possible then such type would work like an internal string type.
Yes, there are are various discussions about the shortcomings of that proposal. The "pre"-imports @lukewagner and I mentioned above provide compile-time imports, enabling the importing code to be specialized to the imported types and values/functions. One application of preimports we hope for is things like this, where we can enable the ecosystem to develop a string type that performs roughly just as well as if it were baked in string type and without having to add the type and operations to the core spec of WebAssembly itself.
The more I think about the pre-import idea, the more I like it as a solution here. Would it allow something like the following (if not, maybe I'm misunderstanding something)?
- wasm modules import an abstract string type, that is known to support "get char at index", "get length", and so forth basic operations.
- A different pre-imported string type could lead to different behavior on the same wasm file (for example, "what is a valid character" differs between different string types).
- On the Web, the natural thing to pre-import is the JS string type. Correspondingly, for wasm on the JVM and CLR, the natural thing would be their builtin string type. In all these cases the platform could provide the type to pre-import (so it's simple for modules to use, and not downloaded each time, etc.).
- Pre-importing the natural string type on the Web, JVM, and CLR could lead to different behavior. Or, if that is not what a specific project wants, it could implement a specific string type in wasm and pre-import that, in which case the behavior would be identical everywhere.
- In all the above cases, string operations are compiled to be fast.
Does that make sense @lukewagner @RossTate ?
And @dcodeIO , if the above is correct, would it give AssemblyScript all it needs?
I'm not sure I fully grok the type pre-imports. If I understand correctly, a wasm module wanting to use JS strings would need to:
- import the JS "string" primitive type (How do you refer to it on the JS end? Note there is no prototype for strings because they are not objects;
String.prototypeis for a wrapper class, not the primitive type) - import native functions for string operations (Would these be the methods from
String.prototype? If so, how would you call them with no way to pass athisparameter from WebAssembly's call opcodes?)
It's unclear to me how to refer to the string type from JS, and how to call native string operations without wrapping them in JavaScript functions that shuffle an argument into this.
The need to support "get char at index" is a big part of the concern with any proposal in this space. If "char" in this case is assumed to mean "UTF-16 code unit", and "get char at index" is expected to do O(1) random access, it doesn't leave much room for implementations to have different kinds of strings in practice, regardless of whether the string type is abstract, imported, pre-imported, or anything else.
First of all, thanks for all the thorough comments pointing out further aspects and seeking for a solution to the problem. Appreciate it!
The type import idea is interesting in a way, in that it is about what one would do today (just not as efficient) by importing wrappers around String, and that taking it one step further leads to pre-imports, and taking it two steps further again summons that ominous universal string type on the horizon that users will ask for eventually. For instance, importing an environment specific string is in fact one solution to the problem, but compiler authors will eventually crave for a more universal way (with browsers being the common denominator) so they don't have to maintain various standard libraries, and users will become tired of duplicating and slightly modifying their entire code base to account for different environments.
Now I know about the challenges involved in creating something universal, yet my view on this remained surprisingly consistent since the interface types UTF-16 discussion, in that I still reckon that the most reasonable thing to do is to bless the family of pretty much ubiquitous UTF encodings abstracted as an universal Wasm string type, reusing the infrastructure browsers already have for expressing strings in various representations. Or, of course, something slightly more complicated ultimately achieving the same.
P.S: On a more general note: Might be that I am in a somewhat special spot due to AS's proximity to JS-land, seeing so many Web(Assembly|)Devs being left behind disappointed who initially were super excited about Wasm. Sure, there is still a constant stream of new people entering Wasm-land, but that won't continue forever and reality will hit us hard. Certainly, i31ref won't save us, but good interop can. It makes me sad to see that we are failing over something that really shouldn't be anything a bunch of the smartest theorists and programmers can't solve. Feeling tempted to close this paragraph with a famous quote by Brendan Eich, but won't, because it may be invalidated if we only do what the quote says so we can eventually add a conditional or even replace a word.
A native string type (and even a host provided string module) sound like a really bad idea to me.
It risks this type being used pervasively which brings more code inside the Wasm module(s) in contact with it, which means more code locations needing to deal directly with conversions, which brings inefficiency, code bloat, and most importantly, since these unknown string representations can generate arbitrary errors, more code that needs to be error aware, and potentially has to deal with errors it wasn't tested with.
The advantage of arrays of i8 or i16 is exactly that they're binary, they can't error while in that format. In fact, they may actual non-text binary which is often useful. For an unknown string representation you have no guarantee what "bits" it is going to preserve. This will bring non-determinism into Wasm.
When it comes to strings, it's important to realize that the bulk of code just stores them or passes them on, and only a small amount of code needs to care about the actual text contained in it, and understanding unicode details. You thus want to optimize for a code structure where the bulk of code does NOT need to deal with encoding errors of implementation formats. A built-in implementation dependent string will do the opposite.
Wasm GC should not be an extension of JS, it is a low-level building block for a wide variety of languages that can choose their own representations. Let's not burden them with JS implementation details.
It's unclear to me how to refer to the string type from JS
Yes, not clear to me either. But I assume this would be something like providing the pre-import in JS from WebAssembly.JSString, that is, the wasm JS API would make it easy to do this.
If "char" in this case is assumed to mean "UTF-16 code unit", and "get char at index" is expected to do O(1) random access,
My hope (which @lukewagner can correct) is that that wouldn't be a problem with the pre-import approach. The imported type would have a "get char" but it would be abstract and not make any claims as to speed of access. One could provide a pre-import that is O(1) or one that has O(N) or anything else (each resulting in a different program, effectively).
compiler authors will eventually crave for a more universal way (with browsers being the common denominator)
I suspect you are right, there is a vacuum here that will want to be filled with a string type. But if pre-imports can work the way I outlined, I think it can work out without speccing such a type. On the web we'll pre-import JS strings, and if someone wants to run their wasm off the web too, and have it run the exact same way, they could link and pre-import a wasm module that implements something equivalent to the JS string type (that's why I asked about this before). It's possible that that will lead - in an organic way, without any spec work - to a JS string-like type being popular in the whole wasm ecosystem. Or maybe wasm on the server (say in the CLR or JVM) will prefer another type, who knows.
A string type that can't guarantee basic access methods would probably not perform well all around; for instance code that frequently uses indexing into a string assuming it will be O(1) will perform badly if handed a string type that performs an O(n) scan on every access. I don't understand how a generic string type would be useful except for data interchange at API boundaries.
On the web we'll pre-import JS strings, and if someone wants to run their wasm off the web too, and have it run the exact same way, they could link and pre-import a wasm module that implements something equivalent to the JS string type
It's certainly an interesting idea that I am not opposed to as it's already a significant improvement, like AS can add a flag --use String=JSString similar to what we do with Math (in fact we could even do something like this already today, with downsides), and eventually repurpose the existing Wasm-native string implementation we have to provide the Wasm module that implements JSString for everyone to use off the web. It won't have any meaningful interop off the web ofc, but I don't know yet how much of a problem that would be. Might or might not make AS less viable in Wasmtime, on the Edge or in Crypto, and that might or might not become a major bummer for our supporters. In general I think that fragmentation like this is bad, and that standards are only good if these reduce fragmentation instead of promoting it.
I don't understand how a generic string type would be useful except for data interchange at API boundaries.
Just for completeness, my initial mental model was something like
stringis a new heap typestringrefis a new reference type,stringref == (ref string)stringrefis a subtype ofeqref,string <: eq(potentiallyanyrefif engines can't guarantee identity)string.new <enc>: [i32, i32] -> [stringref]string.get <enc>: [stringref, i32] -> [i32](traps if oob)string.len <enc>: [stringref] -> [i32]string.lower <enc>: [stringref, i32] -> []string.eq, ... ?
where the encoding is known statically as an immediate, so an engine can make proper decisions on what immutable representation to use where, and when to add another representation to the pool. For MVP encodings I imagined -0x1: WTF-8 and -0x2: WTF-16 (keeping positive values for custom encodings using encoder/decoder functions as arguments), which due to their similarity might both be created right away upon construction. Might well be that just one representation is necessary in certain environments, or modules are limited to exactly one so it becomes a matter of adding a representation at the boundary. Has challenges, of course, for instance when to use replacement characters or trap and whatnot when proper UTF comes into play. Not sure how much sense it makes to advocate for good interop any further at this point, though.
@kripken Yes, what you said above mostly matches my understanding. The only nuance I'd add is that, as @sunfishcode and others said above: if you try to implement "JS-flavored" string type+function imports with a non-JS host string type, it may result in bad performance, semantic inconsistencies or both. That's why I said this solution was ideal for cases where a particular project only cared about on kind of host, say, the Web.
That being said, from my rough understanding of Java and C#, they underwent the same UCS2-to-WTF16 transition as JavaScript that left their strings allowing invalid UTF-16 sequences and their indexing operators returning two-byte-code-units instead of Unicode scalar values. So JS/CLR/JVM might (unless I've missed a detail, which is likely) all be able to share a compatible/efficient set of strings operations.
Sounds good, thanks @lukewagner !
I understand the possible worry, but I think it's ok given your points on strings in Java and C#, and also as mentioned earlier I think it would be fine to end up with a JS-like wasm string module that ends up popular in the wasm ecosystem - there would be no point in trying to prevent such an outcome, and also no point in trying to predict it.
@dcodeIO I'd bet that wouldn't be too bad for AS off the web given the small size of that module.
@kripken (from an earlier post above):
In general not every new feature will help every language or not help them equally.
This is true, but we can distinguish between features used as implementation details, and features that shape cross-language communication. A langauge can use eg. Tables, or not use them, and it shouldn't significantly affect that language's ability to communicate with other languages in wasm (assuming interface types takes the place of passing around i32 "function pointers").
If a JS-style string becomes popular as a way to communicate between languages, and the ecosystem of APIs grows up using that string type, it could bias the ecosystem towards WTF-16 languages, and away from other languages. This is potentially a reason one might wish to prevent a JS-like string from becoming popular as a data interchange format in wasm.
Fair point that some features are more involved in language communication.
If you are arguing against adding a WTF-16 string to core wasm, then I agree with you as mentioned earlier - @lukewagner 's pre-import idea is superior.
If you are arguing against the pre-import idea, then I don't understand your point - pre-importing is a generic mechanism, by itself it favors no particular types (string or otherwise)?
(If you are just concerned - separately from the discussion of possible spec additions - about WTF-16 becoming popular in the wasm ecosystem, then I'm not sure I agree with you or not, but regardless, what can be done?)
Import or pre-import, if the usage is O(1) random access to UTF-16 code units, it favors particular string types.
On lessons from other VMs, JS baked in assumptions about Unicode in the 1990's and is now stuck with error-prone UCS-2 vs. UTF-16 subtleties and unpaired surrogates, and ironically out of step with the Web itself, where today over 95% of pages are UTF-8.
What can be done is interface types. Let's work together to make interface types the best way to pass strings from wasm to Web APIs and between source languages in wasm.
What can be done is interface types.
Can Interface Types provide what AssemblyScript is asking for here?
Of specific concern to me is the ability to receive a string from outside, operate on it, and send it back, all without performing a conversion (the goal should be close to the efficiency of how compile-to-JS languages use strings). My understanding is Interface Types would require a conversion, but I could be wrong!
(If it would, then I think it solves a different problem, but also an important one, that I agree is worth working on.)
I don't even know anymore what is going on here when I see claims that all languages using WTF-16 today, like JavaScript, Java and C#, are out of step with the web itself, solely based on the statement that "95% of (html) pages are UTF-8". Reads as if these languages were broken, while in reality all they did when transitioning away from UCS-2 was ensuring backwards compatibility to not invalidate all code ever written in them over night. WTF-16 became reality out of sheer necessity, and to me the existence of the WTF-8 document indicates that something went wrong in the standards process and cannot simply be ignored. And btw, "over 99% of code running on the web is build against a WTF-16 API" and would have to be rewritten if WTF-16 was phased out. I really do not understand what is going on in this thread, given that the W3C's mission and principles are punched in the face like that. Could as well suggest that WebAssembly should be allowed to break backwards compatibility because that's what we do here. This is completely beyond me, and again mentioning interface types, which is really just for the boundary, doesn't make it any better. Like, seriously, we are talking about interoperability between JavaScript, the first (and very successful) language of the web (and the primary interop target of WebAssembly), and WebAssembly, something we all genuinely care about, here. Is there really nobody else around in the group seeing WebAssembly's specification process derailing entirely? What has happened to the Open Web Platform, and its vision of One Web? π’
@dcodeIO Wasm so far has been very carefully designed to not make assumptions about the host, such as the presence of JS or some other language. It is intended to be a substrate for as many languages as possible, and many kinds of hosts/engines. Some of us would like to keep it that way, to not hurt Wasm's future applications.
That, and it would be good to not perpetuate the spectacular broken-ness that is 16-bit unicode to further systems. If Wasm would have such a type as its "default" that would guarantee we'd be stuck suffering its issues for a long time.
Interface Types are not perfect in all scenarios, but they sure are the least biased solution, allow many scenarios to be efficient, and importantly they mean we will not be stuck with anything.
Some of us would like to keep it that way, to not hurt Wasm's future applications.
Yeah, it is indeed my impression that there are essentially two groups here, and I am apparently on the losing side, but shouldn't be because the goals you state are in contrast to what I signed up for when joining a W3C community group. In fact I believe that if more Web people would know what is going on here, they'd be as mad (or sad, depends) as I am. Unfortunately there aren't more Web people around here, and I also haven't used any of my reach to drive them here because I am really only interested in solving a problem, not in making enemies. Look, I'm genuinely interested in making the Web, and WebAsssembly, the best it can be, and I believe everyone else here is as well, but there is a conflict here that is so fundamental that I cannot just be silent about because I think that WebAssembly is missing the opportunity of its lifetime by not having good interoperability with JavaScript out of the box. To me this weighs more than the alternatives in terms of not hurting Wasm's future. Just look at the requirements document and what I achieved by advocating for good interoperability. Nothing.
Also, if you look closely, I always put the 8 first (well, to be fair, I once switched from the 8 first to the 16 first to make a point), on the interface types issue and here (I strongly believe that C and Rust are great and important, and I do not hate them contrary to popular belief), and that I am not arguing for which encoding is better. I am just outlining the reality we are living in, and why it came to be, and want this issue be addressed in an appropriate way that is not just "UTF-8 everywhere" purism. Instead there are comments like some of the above that I experience as making a fool of me for just standing up for something that should be a given. This is not right, and some might even see this as systematic bullying of an independent developer who just so happened to slide into a W3C community group. That's the tone sometimes swinging with my posts, because that's what I feel but don't express since I'm trying hard to avoid that to be professional, and it's definitely nothing I enjoy to do.
Still, the bottom line is that I want good interoperability with JavaScript out of the box, and if you have better ideas to achieve this than interface types (which is also important), but is currently not perfect in all scenarios as you say, or improve it in a meaningful way (like bridging it to GC objects or whatnot, I am not opposed to that either), then I am happy to listen. If you do not, then all we are achieving here is that the discussion is driving towards becoming less productive, as it already has, and I do not like that either.
I feel like there's been some talking past each other about creating a single "blessed" or "preferred" string type versus making sure it's possible to efficiently and cleanly use the existing WTF-16 host string type on JS hosts to implement WTF-16 in languages that use such strings and wish to interoperate cleanly with the host.
What is needed there isn't to specify a single preferred string format, but to specify enough of the JS host API for GC and types that it's possible to import the string type and operations such as concat and substring in a way callable from Wasm without an intervening JavaScript function.
When running on non-JS hosts a custom implementation of WTF-16 strings could be substituted at no cost of functionality, just having to ship or link the module.
Personally I would consider it a strong failure of Wasm to not consider this use case and treat it well, even if it is specific to the JS host, because the mechanisms are mostly universal and it would benefit key use cases (web interop).
When running on non-JS hosts a custom implementation of WTF-16 strings could be substituted at no cost of functionality, just having to ship or link the module.
I do consider the alternative suggested by @lukewagner and @kripken as a possible fallback solution, but it also puts languages designed for the Web as of today at a disadvantage for designing for the Web, on other platforms. Like, C and Rust do not need that on the Web. For instance, shipping another module that a non-web host doesn't know what to do with puts these languages at a disadvantage due to hurting developer experience and by that acceptance of that language, code size (one could argue that it is small, but one day there are not just strings but also arrays, maps, sets), and, again, interop with the other host, while the inverse use case of running C or Rust on the Web is already covered perfectly well with its own proposal, namely interface types. Furthermore, interface types is repeatedly brought up as the solution for languages designed for the Web, but it isn't. There already is a bias here, and that has to be addressed (but perhaps not any further on this issue). That being said, before I leave with nothing I'll certainly consider pre-imports because it isn't only bad, but then again it is also only necessary because of bias against the Web Platform (sorry, now I'll stop, unless there are more comments putting this in doubt).
WTF8 - "best size" (but only for English without emojii)
WTF16 - "balanced" (and best size for non-ASCII and emojii)
UTF32 - "best performance"
SomeNewUTF format from future which deprecate all of them.
Why not support any (all) of them?
I recommend read this article about abstraction unicode strings:
https://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings
WebAssembly abstract over CPU
WebAssembly abstract over memory
WebAssembly abstract over environment / operation system (WASI)
WebAssembly abstract over cross-language & host communication (interface types)
WebAssembly abstract over string encoding? If all of the above is possible, why is it impossible?
For example Rust already have abstraction for FFI called OsString and OsStr. Python 3 try to solve the same problem but differently
it is also only necessary because of bias against the Web Platform
This is where I get lost. I see no bias against the Web Platform in the pre-imports solution. I do see how it is not biased towards the Web Platform, but that is differentβit is not biased towards any platform. And, from what I can tell, pre-imports serve the Web Platform just as well as a built-in JS-oriented string type would.
Can you help me fill in what I am missing?
+1 for @Brion 's point - I think the goals here do not conflict. Interface Types will help with language interop when the string types do not match, while pre-importing the system string type will help with languages that want to actually use the system string type. The latter is critical on the Web, and I also would consider it a failure if wasm didn't achieve that.
I also agree with @aardappel that we want to keep core wasm generic. Core wasm can have generic pre-importing, while the wasm-JS spec can add a special way to access the JS string type. So again there is no conflict.
The goals of efficient Web interop and of keeping wasm generic might conflict, and a lot of the debate here seems to be about that. But I think @lukewagner 's pre-import idea solves things so that there is no conflict in practice!
Also, pre-imports are a generic idea that should happen anyhow regardless of strings. Whether it will be used, among other things, for strings, is not even a question for the core wasm spec. Similarly, work on Interface Types is already in progress. So it's not like we need to pick between these two ideas - we're going to have both, I think.
code size (one could argue that it is small, but one day there are not just strings but also arrays, maps, sets),
I hope this is less of a problem in those cases. E.g. the GC proposal can define arrays, and the wasm-JS proposal could allow JS to access a wasm array's elements with high efficiency. So a language like AssemblyScript could have really nice interop with the outside. (Unlike with strings, I wouldn't expect the opposite direction of being able to operate on a JS object inside wasm with high efficiency, which is why the cases differ, unless I'm missing something.)
I'm not sure I understand your points @dcodeIO about the comparison to C and Rust. To me it seems clear a language like AssemblyScript using a pre-imported web string type would be at a clear advantage compared to them - because it can avoid copies, and because it doesn't need to ship basic string operations like strcpy etc. (Of course C and Rust have their own advantages! But they were not designed for running on the web.)
I can't tell for sure of course if there will remain conflicts beyond my points about code size, which matters for example in edge computing and blockchain, and general developer experience of having to deploy like 10 modules eventually that one cannot DCE anymore just to run some simple code. That will work, but won't be great, and people will think twice whether something like AS is worth it to use on the edge or in blockchain.
I see other concrete problems as well, in that for example you seem to assume that GC's arrays only solves problems, but in fact it creates new ones in that JS arrays are dynamically resizeable. Hence, a JS-like compiler will try to let's say double the size of a Wasm array for efficiency reasons if it is pushed to multiple times to reserve some space upfront, leaving one with a fixed-size array that is too large to just pass to JS as-is and things like that. There might be solutions to these problems, of course, but as I see it the array type as it stands isn't really designed for interop, so isn't a great example. These concerns are often subtle and easy to overlook.
Then of course there are strings, which once these are a pre-imported JSString cannot just be used to call an external API that wants let's say WTF-8, but need to be lowered to memory in order to be passed through interface types using an adapter. However, some languages might not even want to use linear memory, or reserve it for user code, like after all they opted for Wasm GC. Not sure if interface types can help there. It might, eventually, of course, I don't know. Might already be too late for languages like AS once it does.
I'm not sure I understand your points @dcodeIO about the comparison to C and Rust
That's referring to the more general point I was making in that the Web Platform has been put into third place after C and Rust so far, not so much technical.
Can you help me fill in what I am missing?
See for example my new points above. There is certainly more that I haven't thought of yet, and there are already too many "mights" for my taste since I expect that the group will not want to work on Web Platform issues more than they did so far since that's not what people are being payed for.
I think it's worth splitting the lines according to their purpose:
- best for storage (UTF8/WTF8)
- best for text processing / best for speed (UTF16/UTF32)
- best for interop with other langs / hosts (encoding agnostic)
And try to answer the question which of these categories should be implemented in WebAssembly in wasm GC and / or interface types?
having to deploy like 10 modules eventually that one cannot DCE anymore [..] people will think twice whether something like AS is worth it to use on the edge or in blockchain.
What would they use instead?
I think AssemblyScript will be at an advantage on the web compared to languages not designed for wasm. Off the web, it won't have any disadvantage, will it? C and Rust also need to ship a set/map/etc. impl (likely a larger one, for backwards compatibility).
Btw, if shipping a set/map/etc. impl for wasm-GC is an issue, I think wasm-GC can consider adding more to the spec. Those things could be generic, and not tied to JS semantics. Strings may be somewhat of a special case.
array
Your concerns with array may be valid, I'm not sure. Others would know better. But in general, I don't think you should expect wasm-GC arrays etc. to be identical to JS arrays. Wasm-GC things should be more static and optimizable, as I believe we don't intend for things like inline caches to be necessary to speed them up like in JS. Yes, that can lead to some subtle issues for languages, as you said, but it's both a risk and an opportunity I think.
they opted for Wasm GC. Not sure if interface types can help there
I think Interface Types should help all language interop in wasm, including ones compiled to wasm-GC? We'd like Java, C#, C, Rust, AssemblyScript, etc., all to be able to convert strings etc. back and forth. I'd expect AssemblyScript to benefit as much as all other languages. In fact if Interface Types only helped C and Rust, that would be unacceptable - it would violate the principles mentioned by @tlively and @aardappel
What would they use instead?
Can only speculate, but possibly a language that is reasonably general, like systems languages. Scenario might or might not be: Wasm GC doesn't have good interop -> Languages won't use it for interop / at all -> Engines won't provide interop -> GCed languages are at a disadvantage.
Btw, if shipping a set/map/etc. impl for wasm-GC is an issue, I think wasm-GC can consider adding more to the spec. Those things could be generic, and not tied to JS semantics. Strings may be somewhat of a special case.
I cannot quite follow there. Strings are by far the most important one, but now we are talking generic sets and maps to be considered by the spec? How would such a map or set be any better than the current array that is already not generic, but a strange mixture of high level data and low level semantics that isn't quite good enough for anything? What we are looking at here are conflicting goals (...).
I don't think you should expect wasm-GC arrays etc. to be identical to JS arrays
Like here for example. It's not reasonable to expect these to be identical, but it is reasonable to expect good interop. But to have good interop, these must become more similar. Make it too simple, and it isn't good enough. Make it good enough, and it isn't exactly simple. Where do we draw the line? Limping interop?
Can only speculate, but possibly a language that is reasonably general, like systems languages. Scenario might or might not be: Wasm GC doesn't have good interop -> Languages won't use it for interop / at all -> Engines won't provide interop -> GCed languages are at a disadvantage.
Why would wasm GC not have good interop in edge/server (the context in this sub-thread)? It will have just as good interop as systems languages. Interface Types isn't a thing for linear memory languages, it's generic.
In more detail: AFAIK it'll be no harder for a toolchain using wasm-GC to define the Interface Type lifting/lowering operations - in fact, it's probably easier for GC since no malloc/free issues! And doing a copy in a lifting/lowering on GC data is no worse than doing a copy on linear memory.
The scenario I outlined imagines Wasm GC not having good interop in general, including on the web, making it less useful for languages, thus less desirable to support for engines, and ultimately leading to disadvantages for GCed languages on the edge/server, so people will rather pick a systems language that is reasonably general since it doesn't need GC (even doesn't strictly need interface types on most contemporary runtimes). As I noted this is all just speculation.
A more concrete concern is that having an universal stringref avoids unnecessary copies and garbage if the host (or another module for that matter) understands it. As a worst case scenario I imagine an AS module running on a CLR host, where if interface types are used there is an unnecessary copy at the boundary instead of passing the string, leading to unnecessary garbage on one side. How much of a problem that is depends on GC implementations in the wild I guess, and how hot a code path is, but I see potential being left on the table there too. In interface types terms, that'd be a gc-to-gc adapter instruction if I'm not mistaken. Does that even make sense? Reminds me of the double re-encoding problem that originally motivated the interface types issue.
As a worst case scenario I imagine an AS module running on a CLR host, where if interface types are used there is an unnecessary copy at the boundary instead of passing the string, leading to unnecessary garbage on one side.
Wasm on the CLR can do what (AFAIK) the CLR does for other languages today, which is to use the system string type, avoiding any copy (which would be strictly better than anything C or Rust can do). In wasm's case, that could be achieved using a pre-import (exactly parallel to what I expect on the web).
Of course such an implementation could also use a different string type. If that is different than the CLR string then it could do a gc-to-gc Interface Types adapter. That would have a copy, but should be at least as good as an adapter from a non-GC language (which also has to handle malloc/free).
Speculation is fine of course, but I don't see the basis for your pessimism? Everything points to wasm GC having great interop using pre-importing and Interface Types.
Forgive me if this has already been discounted. However, I would like to offer these observations:
- JS engines use a wide variety of internal representations for strings. They can do this because JS is a high-level language that does not offer direct access to the underlying string representation. IMO, this is a ground truth with many implications.
- Direct indexed access to chars will always be difficult. If you enforce a standard such as UTF8 or UTF16 you disadvantage many people >1Billion at the last count.
- Tying JS internal representation to that of AS (for example) ties the hand of AS as well as JS.
On the other hand, a model where there is an abstract web string type, together with a set of string APIs would allow decent interop without copying. What you would lose in that architecture is direct indexed access to the internal chars. What you would gain is seamless interop with complete agnosticism over issues like UTF8/16/20.
best for storage (UTF8/WTF8)
Yup, and the bulk of code touching strings qualifies as "storage", since they never need to process these strings at a unicode level themselves.
best for text processing / best for speed (UTF16/UTF32)
Nope.
There's a ton of parsers out there for programming languages, serializers, and other text formats that work directly on UTF-8 (because that's what the input data is), and do so without a speed or complexity penalty. Like @sunfishcode said, pretty much all text data in the world is UTF-8, and is processed by these kinds of parsers. If you're bringing any kind of parser into a Wasm module for speed reasons, it is likely this kind of parser too.
Only when you do natural language processing or font rendering (which are both highly specialized kinds of code!) could you consider another representation. But if you're doing this kind of code, I would certainly hope that you use specialized libraries like harfbuzz, unibreak, iconv and friends (because this stuff is seriously complicated to get right) rather than rolling your own using your PLs UTF-16 strings.
best for interop with other langs / hosts (encoding agnostic)
Yes, the best interop is interop specialized to the two endpoints: Interface Types.
Nope.
There's a ton of parsers out there for programming languages, serializers, and other text formats that work directly on UTF-8 (because that's what the input data is), and do so without a speed or complexity penalty. Like @sunfishcode said, pretty much all text data in the world is UTF-8, and is processed by these kinds of parsers. If you're bringing any kind of parser into a Wasm module for speed reasons, it is likely this kind of parser too.
I meant natural language processing or string searching / patterm matching for example. ASCII and UTF-16 algorithms could utilize fixed length of code point and use lookup tables (preprocessed alphabet for example) for speedup processing which impossible with utf-8. Also decode codepoint from UTF-16 much faster than decode same codepoint from UTF-8. Also calc actual byte size of UTF-16 string could be done for O(1) time, while UTF-8 in O(N) time
At this point, perhaps it's a good idea to summarize. Much of the above is based on my observations that GC is our best chance to solve problems that we'll otherwise need additional, potentially complex or too generic, mechanisms to address. Some of the challenges were brought up early or only became apparent over the course of the discussion when thinking more about interfaces types and pre-imports.
Fixed sized array
In some programming languages, like JavaScript, arrays are resizable by means of pushing to them (multiple times) for example. So what some compilers do is to resize arrays in an efficient way, for example by doubling the backing store if a push would otherwise just grow the array by one element at a time, and keep a separate length indicating the valid region.
In the MVP, this trick can be used if the element type is nullable or otherwise defaultable, but cannot be used if it is not. Furthermore, if the trick is used, the array implicitly becomes non-interoperable with JS because the other side would assume an invalid region of elements.
I also briefly mentioned other data structures like sets and maps that might be worth to think about, but I consider these rather post-MVPish.
Interface types adapters
If a string (or a more complex object) is not just pure data, but the data is let's say boxed into a struct with an additional i32 field that is the string's hashCode for example, interface types adapter instructions seem not to be powerful enough to make two modules interoperate that export or import strings (or more complex objects) like this:
(module
(type $JVMString (struct (i32, (ref array i16)))
(export (func (result (ref $JVMString))))
)(module
(type $CLRString (struct (i32, (ref array i16)))
(import (func (param (ref $CLRString))))
)Ecosystem benefits of a common string type
With pre-imports, the expectation expressed in the issue is that all modules running on the Web piggyback on top of a pre-imported JS's string or all modules running on the CLR piggyback on top of a pre-imported CLR's string to make sharing strings efficient. This requires modules to be compiled for a specific target, with compilers maintaining multiple standard library variants either abstracting the differences away or expecting the user to modify their code for different standard library implementations instead. To me this seems unlikely to gain traction because some languages might not want to provide multiple abstractions or standard libraries, or because users expect that their module runs everywhere, or with many modules (think npm) the process becoming impractical. But if modules end up not utilizing the host's or otherwise common string type, it seems that interface types adapters need to become more powerful, as outlined above, or that a new category of performance bottlenecks is introduced as outlined below.
Alloc+copy->garbage on the boundary
If we assume interface types become powerful enough, we may still get into a situation where converting on the boundary with let's say a gc-to-gc adapter instruction introduces unnecessary allocations, unnecessary copies, and unnecessary garbage objects for the GC to collect, becoming a problem in hot code paths calling external APIs many times with strings (or more complex objects).
The Web in WebAssembly
As a more general concern, I was worrying that not enough thought is being put into ensuring that the parts of the Web, i.e. JavaScript, Web APIs and WebAssembly, work well together. I've made the experience that interoperability between JS/Web APIs and Wasm is a major pain point as of today, so not solving the challenges when we can, for instance when speccing GC, may lead to Wasm remaining impractical in terms of for example code migration. In fact I believe that we may be looking at what one day may be seen as Wasm's biggest blunder when letting the opportunity slip.
AssemblyScript and GC
We are obviously very interested in the GC proposal for various reasons:
- Reduce code size by not having to ship an MM and GC combination
- Improve interoperability with the web or other modules, which might or might not be a reasonable expectation
- Prototyping specs in AssemblyScript is rather straight-forward anyway
As such I've offered to help the group prototype GC at various occasions, for instance implementing the GC proposal in Binaryen for everyone to experiment, but I've sadly grown less enthusiastic about the GC spec over the course of this issue. The requirements for AS haven't changed, however, yet seem to be out of sync with the requirements proposed here, so I broken-heartedly decided to re-prioritize for the time being and revisit when I see fit.
Please take what's useful from my summary, thanks :)
Thanks for that writeup, @dcodeIO! Your position certainly seems very reasonable, given AssemblyScript's needs and priorities. I hope we can continue collaborating on making Binaryen (and WebAssembly!) better and more useful for AssemblyScript, if not in GC right now, then in other areas :) I also agree that good Web interop is critical for the long-term success of WebAssembly, and if the current tooling and spec efforts around reference types, interface types, staged compilation, and ESM integration aren't sufficient, I would want to revisit more "baked-in" solutions.
Interface types adapters
If a string (or a more complex object) is not just pure data, but the data is let's say boxed into a struct with an additional i32 field that is the string's hashCode for example, interface types adapter instructions seem not to be powerful enough to make two modules interoperate that export or import strings (or more complex objects)
I actually have the opposite understanding: if attempting to interoperate by sharing memory (either wasm-GC-allocated memory or linear memory), both sides must have the exact same representation of the memory, due to the low-level nature of memory. In contrast, interface types is entirely oriented towards supporting quite distinct representations on either side of a module boundary. (In particular, the most recent PR speaks to supporting a wide variety of string representations. It only just went up, so understood if this is new information.) But of course, the tradeoff is a copy at the boundary in the general case (with three planned ways to achieve zero-copy: when both sides lift/lower from/to the same opaque host reference type (mentioned in the TODO), when both sides lift/lower from/to the same immutable, canonical GC memory type, when one side is wasm using the canonical representation and the other side is the host that has optimized for the canonical representation). Totally understood that this tradeoff may not make sense for a language targeting tight interop with JS specifically, in which case I think the preimport route mentioned above makes sense.
Thanks, wasn't aware of the PR. I'm still concerned that pre-imports have problems (ecosystem fragmentation) that interface types (alloc+copy->garbage at the boundary) cannot solve and vice-versa, so even if we recommend to use A over B, or recommend B over A, languages are still facing the drawbacks of the approach they have just committed to. For instance, pre-imports are nice for a Web-y language from a web perspective, so seem easy to recommend, but the language somewhat has to commit to the Web then because off the web they'll face ecosystem fragmentation or need to implement separate standard libraries. And interface types seem like these will never quite cut it (sorry) due to alloc+copy->garbage overhead, except in the simplest of cases, i.e. unless the particular type of interop one is looking for is baked-in, like strings, and a language's representation explicitly permits using what's baked-in.
I don't think there is a solution that avoids copies in all places while using a single string type everywhere, which seems to be what you're asking for?
To avoid copies everywhere you need to use the native string type everywhere. That has been done in practice, for example in Python: CPython, Jython on the JVM, IronPython on the CLR, Pyjamas on the Web. Each uses a different string type, so the same Python may run differently in each of those.
Or you can use a single string type, like say C does. The same C code should always run the same. When you're on the web you need to bring your string-handling library with you (not optimal, but it's ok), and will need to copy. A web-friendly language could be in the opposite situation, where it needs to bring a string-handling library when off the web (also not optimal, but also ok).
Each of those approaches has different tradeoffs. But it's just not possible to get the advantages of both at once.
I would go further than @kripken: different languages have different internal & external semantics for strings. Asking for a common concept of string is almost certainly impossible.
Having said that, if one was willing to undertake an effort analogous to IEEE756, then maybe the industry would eventually coalesce around that. Personally, I would not recommend holding one's breath.
That has been done in practice, for example in Python: CPython, Jython on the JVM, IronPython on the CLR, Pyjamas on the Web. Each uses a different string type, so the same Python may run differently in each of those.
Makes me wonder if this is a desirable outcome. Rather seems to me that not having to do the same work over and over again, potentially introducing slight incompatibilities here and there, is generally preferable. Just because it can be and has been done, doesn't mean that it's what we should strive for. Somewhat similar to WTF-16 arising out of necessity, with a single UTF-8 being the preferable outcome (as per several comments in this issue).
I would go further than @kripken: different languages have different internal & external semantics for strings. Asking for a common concept of string is almost certainly impossible.
My impression so far was that if we'd support WTF-8 and WTF-16, we'd cover like 90% give or take of languages potentially running on Wasm one day, and that anything else will be hard to integrate with anyway, no matter the amount of proposals we make. Curious how far off the estimate is, but if it isn't too far off, then I think it isn't exactly "impossible" and may be worth to think about due to the sheer ecosystem benefits it has.
But it's just not possible to get the advantages of both at once.
Perhaps we can do better than those before us and get close. Would be so worth it if we can.
My impression so far was that if we'd support WTF-8 and WTF-16, we'd cover like 90% give or take of languages potentially running on Wasm one day
I agree we should try to do better than previous solutions. But even if this statement is true, it's two string types, with copying between them when languages disagree, which leaves most of the problem? Maybe I'm misunderstanding.
If you're suggesting adding these two string types to wasm, then I think the only benefit that would give is avoiding the need to ship a string library in a few more cases? That's a reasonable benefit, I'd agree, and I'm not opposed to that. But it's not a huge benefit, so I'd say we can wait to see how big a problem those few extra bytes are in practice (especially since the Web, where code size probably matters most, won't have this issue).
Just created Definitely not a Universal String Proposal for WebAssembly, outlining the approach I initially had in mind, but in more detail. See the implementation notes for how it attempts to avoid copying or other inefficiencies. This is not a proposal, but perhaps it helps to inform the discussion.
@dcodeIO is it a deliberate choice to present a limited number of operations on strings in this proposal, or are they meant as a starting point to flesh out later? My main concern is that without access to individual code units there's no efficient way to use such strings to implement common string APIs on top of them: reading a single code point at a random index, or concatenating two strings, requires lowering the entire string into linear memory and then doing it yourself.
I worry this limits the usefulness to a sort of long-lived version of Interface Types -- you can create an intermediate representation for boundary transfer that you can reuse, or pass through to another module without an additional copy.
But, that's actually a valid use case I think...
Yeah, so far it's mostly based on what I mentioned earlier in #145 (comment) enriched with details. Not super fledged out, and not an actual proposal (yet?) but I figured it might be great to be able to discuss about it, and bringing up code points is already doing that, so thanks! How would you design the respective instructions for code points? (in fact, maybe a PR there would be great for discussion)
I'll make a PR there after I've pondered on it. :)
@dcodeIO while thinking on that I found myself wondering how important the actual string type is, and am wondering now whether a useful way of thinking is to model strings as immutable arrays of u8 or u16, which can be bridged to host strings via the Interface Types conversions.
Arrays already provide the indexed element load instruction needed to iterate or do random access, and the most important operations to optimize are bulk-memory ops like concat & slice that are common with strings.
At the Interface Types layer, raising an (array immutable u16) to a string could copy it to a JS string, but it could also be fully optimized by backing the JS string with the array's existing buffer. Likewise in the inverse direction if the JS string uses a 16-bit backing buffer (they don't always).
A UTF-8 conversion would be applied on raising/lowering from/to (array immutable u8) for JS hosts, which could cache it for reuse of the same array/string.
I think this might give me what I would want for a JS subset if I weren't willing to directly use JS strings via function imports, and if engines optimize it it's potentially very cheap when using 16-bit strings.
However there may be other benefits to a more directly shared type that's string-specific.
I agree that part of the challenges can, theoretically, be solved by interface types, but not all. For instance, we're still left with:
- Avoid ecosystem fragmentation as would be introduced by separate mechanisms to use on and off the Web (when taking pre-imports into account)
- Avoid alloc+copy->garbage at the boundary in between two Wasm GC-enabled languages and/or JavaScript (when copying remains necessary)
- Avoid code size hits by having to handle strings explicitly (with adapter functions) at the boundary or shipping basic string functions and their dependencies with each module
- Avoid hurting developer experience, like having to author, publish, ship and/or install a variety of adapters for/in different use cases
- Avoid redundant re-encodings when forwarding strings through multiple modules expecting varying encodings (think npm)
@dcodeIO Thanks for writing that up!
I wouldn't object to add WTF-8 and WTF-16 types to wasm. But adding two types means you're not giving the ecosystem a clear signal, and likely you'd end up with two ecosystems of strings, which means copies between them.
Also, both of those types would not have optimal interop with strings on the CLR, JVM, or Python, for example (as your proposal says, they'd need to perform a check at the boundary - and also copy if there is an 8/16 mismatch), so I guess those are not in the 90% you expect the proposal to cover? But for many people those are very important platforms. I don't have numbers, but I'd strongly suspect the sum of those is far above 10% no matter how you look at it. (To really get something like 90%, I'd guess you need to also add UTF-8 and UTF-16 - but that just helps with some hosts, and still doesn't fully address the checks and copies between the 4 types.)
Overall I think that's a reasonable proposal, and I wouldn't have a problem with it. But the benefits seem moderate. I'd lean towards waiting on doing anything to see how the ecosystem evolves, which string types are used in practice, and how bad the overhead of copying etc. is - we can always consider adding these types later.
But adding two types means you're not giving the ecosystem a clear signal, and likely you'd end up with two ecosystems of strings, which means copies between them.
I don't think that it is WebAssembly's business to give a clear signal (-> to bias, especially not against JS). We had that discussion earlier, and people agreed iirc. I do not agree with your assessment that we'll end up with two ecosystems, because definitely-not-a-proposal is designed to be exactly inclusive across languages, engines and the Web using either encoding, making the bulk of languages, engines and the Web work together seamlessly.
Also, both of those types would not have optimal interop with strings on the CLR, JVM, or Python, for example (as your proposal says, they'd need to perform a check at the boundary
The check at the boundary is much cheaper than alloc+copy->garbage, since the common case is that strings are well-formed. Still a net-win. This is also noted in the document, as are the reasons for picking WTF. Indeed we should make a list of languages and the encodings these (would) use (in WebAssembly with Universal Strings). For instance, I'm not so certain that all of the CLR, JVM and Python actually/strictly require well-formedness, or if one does, that this cannot be handled more efficiently otherwise within their WebAssembly target's string implementation.
I'd guess you need to also add UTF-8 and UTF-16
I am happy to discuss a PR adding them. Needs to figures out trapping behavior. For instance, trapping behavior would immediately disqualify some languages, while picking WTF does not disqualify anyone. Previous discussions also indicated that a small set of initial encodings is preferred by some group members, but I (remember my words: I'm the good guy here) am happy to discuss and be convinced! Not gatekeeping my stuff at all.
I'd lean towards waiting on doing anything to see how the ecosystem evolves
Have been wondering if you'd make that argument actually. Interesting.
The check at the boundary is much cheaper than alloc+copy->garbage, since the common case is that strings are well-formed.
Very interesting - this may be a crucial point here! I disagree with that statement. Allocation and collection may be pretty cheap in a highly-tuned GC, O(1) (if allocation is a bump and it uses a generational GC where that item never makes it past the nursery). Whereas a check or a copy would be O(N) to traverse the string.
Maybe I'm making too much of an O(1) / O(N) difference here, as for a short string it'll all be in the cache. But imagine a loop that sends strings to the other side just to be compared to something. Optimally we'd want no O(N) operations on any of those - no copies or checks on the boundary, and the strings are interned so comparison is trivial. Languages compiling to JS have this today, I'm hoping wasm can too.
So in general, I'm worried about
- allocation using malloc/free (close to
O(1), but has fragmentation issues), but not GC - checking
- copying
It may be possible to avoid the full check, for example storing a "valid" bit. But that may increase the object size and/or add VM complexity. I guess that's what you refer to here:
For instance, I'm not so certain that all of the CLR, JVM and Python actually/strictly require well-formedness, or if one does, that this cannot be handled more efficiently otherwise within their WebAssembly target's string implementation.
I don't know enough about those details, but I'm curious!
Allocation and collection may be pretty cheap in a highly-tuned GC,
O(1)
What I can tell is that it isn't that easy in for instance TLSF, which is what I have the most experience with, but that's also more a predictable-performance MM than anything else. Other MMs typically optimize for small block sizes, and often can't guarantee the same fast operation for larger ones. But even then, if a string is small, the overhead of checking is small anyway. Regarding a GC, I guess Immix is very close to just bump allocations, but even there what you say creates fragmentation, and more fragmentation means that it has to obtain a new block from the global allocator more often, and will have to opportunistically defragment the block eventually, again at a cost. As such, oversimplifying to a comparison with an ideal bump allocator doesn't seem like a well-thought-through argument to me. More allocations, more copies and more garbage is pretty much always bad as far as I can tell. I can see if I can gather more intel on this, and if you or your coworkers can as well, that'd be great!
It may be possible to avoid the full check, for example storing a "valid" bit
Interesting idea! Might even be possible to set the bit upon construction or re-encoding of the string, making definitely-not-a-proposal even more awesome! π (mind making a PR?)
As such, oversimplifying to a comparison with an ideal bump allocator doesn't seem like a well-thought-through argument to me.
There are a few more technical details, but allocation in the nursery really does become essentially a bump, and there is no fragmentation issue:
The Nursery, on the other hand, just grows until it is full. You never need to delete anything, at least until you free up the whole Nursery during a minor GC, so there is no need to track free regions. Consequently, the Nursery is perfect for bump allocation: to allocate N bytes you just check whether there is space available, then increment the current end-of-heap pointer by N bytes and return the previous pointer.https://hacks.mozilla.org/2014/09/generational-garbage-collection-in-firefox/
(of course those benefits only accrue to the nursery; heap allocation in general definitely has fragmentation issues, even in a GC language)
While searching I found this quote which I think summarizes it very well:
What's really important about generational GC is that heap allocation becomes nearly as cheap as stack allocation [i.e. O(1) - kripken]. That's a game changer.https://news.ycombinator.com/item?id=17887235 (apologies for a HN comment, but you can trust that author!)
Thanks, these are interesting pieces of information and I'll look into these in more detail. As a first response I can offer
Most objects will be allocated into a separate memory region called the Nursery. When the Nursery fills up, only the Nursery will be scanned for live objects.
from the first piece, so there's still some GC pressure there by filling up the nursery faster, triggering a scan of the nursery more often which has to promote objects to the tenured region more often and stuff like that. Not saying that this isn't great, because it looks impressive to me in general, but also not O(1), especially when taking into account that this is really only the ideal case. Curious how often that case triggers in an average program.
Gone through it now and made notes:
One (more generally) interesting aspect is that GGC assumes that copying is possible (it typically is if everything is JS with no external references), which might not be the case everywhere, especially off the web. Wondering if my observation has impact in practice, though, because copying may be hidden behind the ref types iff native code must unwrap the string first, i.e. likely copy so no references can break. If it wouldn't unwrap and copy, but reference, objects in the nursery that are live cannot be moved (technically can still be copied if the object is immutable like strings, not sure if there are other implications) without breaking the reference for instance. Thought I mention, because there might be other implications for GC algorithms off the web with this that might be interesting to talk about in other discussions.
we refer to Nursery collections as minor GC
and it also incurs some overhead during normal operation
Then, this summarizes my first impression above quite well: One reduces the amounts of major GCs to do, but still has the cost of minor GC including whatever a minor GC implies, like promoting objects or more work to do overall simply because the concept is there. Might be small, though, but it's not free.
Then there are also situations like
Consider the question of how to figure out whether some Nursery object is live. It might be pointed to by a live Tenured object β for example, if you create an object and store it into a property of a live Tenured object.
So we only care about the Tenured objects that have been modified since the last minor (or major) GC [...] In technical terms, this is known as a write barrier.
With a store buffer, the time for a minor GC is dependent on the number of newly-created edges from the Tenured area to the Nursery, not just the number of live objects in the Nursery.
indicating not immediately obvious overhead in minor GCs that are more complex than bump allocating, leading to
Also, keeping track of the store buffer records (or even just the checks to see whether a store buffer record needs to be created) does slow down normal heap access a little, so some code patterns may actually run slower with GGC.
indicating additional work to be necessary to enable the concept. and that there are other implied drawbacks again.
You are still right that
On the flip side, GGC can speed up object allocation.
but then the conclusion of the ideal-case (only) micro-benchmark still correctly (and I value that!) states
Note that this benchmark is intended to highlight the improvements possible with GGC. The actual benefit depends heavily on the details of a given script. In some scripts, the time to initialize an object is significant and may exceed the time required to allocate the memory. A higher percentage of Nursery objects may get tenured. When running inside the browser, we force enough major GCs (eg, after a redraw) that the benefits of GGC are less noticeable.
Hope this writeup, which is really just my impression from reading the pieces (and I only have good enough experience, I'm not an expert), is useful :)
On another note on WTF, I found this document on unicode.org, and under 2.7 Unicode Strings it says:
Depending on the programming environment, a Unicode string may or may not be required to be in the corresponding Unicode encoding form. For example, strings in Java, C#, or ECMAScript are Unicode 16-bit strings, but are not necessarily well-formed UTF-16 sequences. In normal processing, it can be far more efficient to allow such strings to contain code unit sequences that are not well-formed UTF-16βthat is, isolated surrogates. Because strings are such a fundamental component of every program, checking for isolated surrogates in every operation that modifies strings can create significant overhead, especially because supplementary characters are extremely rare as a percentage of overall text in programs worldwide.
So seems Java and C# don't strictly need well-formedness just as JS. Perhaps it's even fair to conclude that this is rather the norm than the exception, so the WTF family of encodings makes perfect sense. Also mentions other benefits I haven't brought up yet.
Why should there be different encoded strings? There can be different factory instructions for an abstract string type. The internal representation of the string should be a job of the host. Java use internally also different encodings for strings.
A string.len should return the length without the knowing of the encoding that was used for creating.
The idea behind both 8 bit and 16 bit encodings (these are the two used in practice) is a use case like this: If we specify only an 8-bit encoding, and a module using a 16-bit encoding in its language passes a string to another module using a 16-bit encoding in its language, and now the second module wants to know the length of the string in its 16-bit encoding again for instance, then we end up in a situation where the first module has to (implicitly) re-encode, and the second module has to implicitly re-encode again. That's essentially the double re-encoding problem motivating WebAssembly/interface-types#13 years ago, and it also applies if no encoding is specced but the engine picks one.
The JavaScript language has also different implementations of a string. And there is no re-encode needed to return the length. You think too much in low level programming language. A sting is a high level object on the . It can be a list of 8 bit and 16 bit chunks and more more. A length is only an attribute of it.
Important is that a string reference is immutable.
A string reference from u8 and from u16 can be equals that a ref.eq must work between both. Also a concat must be possible.
If I need to know the encoding from the creating module then every module must re-encode it to its own encoding. Then there are exists more as one string reference type. But last not least. If I pass such string to JavaScript then the call to length() must also work without knowing of the crating encoding.
Noticed that you are working on JWebAssembly, which I think will have to solve very much the same challenges AssemblyScript has. If you want, I can offer to meet up with you on say Discord so we can discuss this in more detail? Curious to see how your adventure is going, and to have your ideas. (btw, seems we are both from .de)
@dcodeIO We can discuss this on https://discordapp.com/channels/453584038356058112/731539251698729030/765641907065192460 or any other location.
Agree with @dcodeIO and @Horcrux7 on a lot of points here. Wasm needs efficient immutable UTF-16-ish strings which are fully compatible with JS/DOM string in both directions. It doesn't make sense to copy strings on Wasm boundary, for languages, like AssemblyScript, Java and Kotlin, which are fine with using JS string encoding.
@jakobkummerow I wondering, is some form of this implementable in engines?
@skuzmich : Sure, we obviously have support for JavaScript strings, and exposing those to Wasm would be easy. We don't have UTF-8/WTF-8 support yet, and adding that likely wouldn't be pleasant in terms of implementation complexity, but it's certainly doable. I think it mostly boils down to a spec question: what kinds of strings and string operations do we want to have in Wasm?
I agree that for JS<->Wasm interop in particular, having compatible strings would likely be very handy. I have no particular opinion on what Wasm strings should look like. Following the discussion here with interest. Given the complexities around strings, I'm inclined to think that while they feel related to the GC proposal, they probably deserve to be split out into their own proposal, just to let us focus on one thing at a time.
As an example: JavaScript, by spec, requires UCS2/WTF-16, so JS engines support that. It turns out that lots of memory is occupied by strings in many web apps, and many of those strings are one-byte strings, so V8 actually uses a one-byte ("latin1") encoding internally when possible. That makes everything much more complex, and requires copying conversions every now and then, but the memory savings are considered worth it. I see this as one data point illustrating how tough the tradeoff space is, and what kinds of ugly compromises Wasm strings may have to make in order to satisfy a number of competing goals.
My understanding is that 1-byte strings in JS engines are a 1-byte subset of Unicode (eg, ISO 8859-1) which maintains the core invariants:
- length is fixed and known
- reading a character code by index is possible O(1)
- the character codes are the same
- the contents and length are immutable
Copying from a 1-byte to a 2-byte string would only be required if you were already copying it for some reason, such as a concatenation.
None of these conditions would be true of UTF-8/WTF-8 strings which seem to be proposed as the alternative to UCS-2/WTF-16.
@jakobkummerow thanks! People had concern that multiple internal representation of JS strings could prevent Wasm from efficiently using them. It is reassuring to hear that it is doable.
JS String operations that Kotlin's String class would care about are:
- Construction from an array of 16-bit units.
- Getting length.
- Getting i-th 16-bit Char.
- Checking content equality of two strings.
- String concatenation.
Last two are easily implemented using others, but if they are optimized for some variants of internal representation, it would be nice to have them in Wasm.
Btw this issue may relate to discussion about UTF16 vs UTF8 interop here:
WebAssembly/interface-types#38
I think the interface types question about UTF-16 vs WTF-16 is quite separate from the question about whether, within a component, we want to allow passing strings across certain language boundaries. Here's my mental model of the situation, but I may be misunderstanding the point of components, so please correct me if this doesn't match others' intentions.
At big component boundaries, surrogate-checked UTF-16 makes sense to me, maybe with an opt-in for WTF-16 omitting surrogate checks. Coming from JS, this regularity will cost a linear-time scan, in addition to probably flattening (of the internal rope data structure) and maybe copying the string (depending what it ends up linking to). I'd be fine with including an unchecked WTF-16 domstring type initially, but I have no strong opinions, but I don't really see the rush.
Within a component, it will often be important to mix multiple programming languages, but it is even more important to avoid linear-time copies, checks, conversions between 8-bit and 16-bit string representations, and flattening of ropes. The importance is greater because there is more frequent expected back-and-forth within small modules of a component. This back-and-forth is amplified by how there can be reference circularities within a component, whereas there is a hierarchical relationship between components.
One example of such tight back-and-forth interaction which @dcodeIO provides is interaction is between AssemblyScript and JavaScript. In Igalia, we're interested in providing good cross-language support between C/C++ and JavaScript. In both cases, copying, flattening, converting between 8-bit and 16-bit, or doing a linear-time check over the string to detect missing surrogate pairs would add significant cost, due to the intention to enable lots and lots of strings crossing the boundary.
For the within-component case, the ideal solution would allow not just malformed surrogate sequences and 8-bit strings as V8 has, but also would allow JS Strings implemented as ropes (e.g., as the result of concatenation) to be passed to Wasm, possibly manipulated, and passed back, without verifying paired surrogates or even flattening the string.
To add a new type: my own personal aesthetics would be to start simpler, with a single built-in type and instructions to manipulate it, rather than the more complex pre-import strategy, but either option could be OK. (I'm OK with a built-in type being a little opinionated; I'm not sure if true neutrality is ever possible.)
I agree with others on this thread that a solution in this space is related to the Wasm GC proposal, but probably makes sense to pursue separately from this repository. In principle, such a separate proposal wouldn't actually depend on the GC proposal, since it could be refcounted, but I think we'd want to make it a subtype of externref (so, overall, it's very similar to typed function references). This new string type (or pre-import scheme) could provide a lot of value for both @dcodeIO 's and Igalia's use cases.
I suggest that we continue with the current MVP approach, to start out both Wasm GC and interface types without a solution for zero-copy, fully expressive string sharing across languages and components, while working in parallel on further proposals in this area such as the domstring interface type and a built-in Wasm string type (or pre-import scheme).
Now that we have https://github.com/WebAssembly/stringref, I'll go ahead and close this issue. Without that additional proposal, the options for strings include importing them as anyrefs or using byte arrays.