Langues where strings are primarily UTF-8

Question

Langues where strings are primarily UTF-8

eqrion opened this issue 2 years ago · 19 comments

I'm trying to figure out how languages that primarily use UTF-8 for their strings would use this proposal.

The first example that comes to mind is Rust, however a Rust String (which exists in linear memory) can be coerced to &str and so neither type can be transparently a stringref. So you'd need to either: (1) rewrite code to use a WasmString type or (2) copy on the boundary into linear memory. (2) isn't really different from what we have today from what I can tell.

Thinking about (1), I'm skeptical that code is going to rewrite to use it, but assuming that they do I'm not sure how it would utilize this proposal.

My best guess is they'd:

Represent WasmString as stringref so that you can use string.eq/concat
Get a stringview_wtf8 whenever an accessor is called (like indexing)

The concern I have with this is that SpiderMonkey wouldn't be able to store WTF8 contents inside our stringref for the medium-term future. So every single accessor call, like indexing would force a transcode from the stringref to the view.

Maybe you could make WasmString cache the wtf8_view lazily so it can re-use the view from a previous accessor call? But then strings would have twice the memory overhead.

Am I missing something? I also would be interested in other languages, but my mind is coming up blank.

Answer 1 · 2023-06-08T19:24:59.000Z

Yes, I think that Rust will continue to use its own linear memory for the foreseeable future.

The benefit of this proposal (with regards to Rust) is at the boundaries: you can compile Rust code to Wasm, and that Wasm module can export stringref which can be used by other Wasm modules.

Similarly, if the Rust code imports another Wasm module, it can easily convert the stringref back into a Rust String.

Similarly, it makes it a lot easier for Rust code to send strings to / from JS, because a JS string can be a stringref.

Performance-wise it should be similar to what we have today, but it improves interop and composability.

Answer 2 · 2023-06-08T19:34:29.000Z

I'm not sure that's any better than the current state of interop of Rust with other code on the Web today. Today you already can create a JS string from Rust string and use it on the boundary as an externref. Other modules can accept that and convert it to their own string value, or use it directly.

Outside of the Web, you need some larger interop story to handle more data types than strings, such as records/arrays/etc. If you're using the component-model, that allows you to pass strings between components without stringref.

Answer 3 · 2023-06-08T20:18:09.000Z

Today you already can create a JS string from Rust string and use it on the boundary as an externref. Other modules can accept that and convert it to their own string value, or use it directly.

That is already covered in the overview.

Performance-wise, that always forces 2 conversions to / from a JS string, with full O(n) transcoding. With stringref that conversion can be improved significantly.

For example, imagine two Wasm modules, both of which were compiled from Rust.

Because both modules were compiled from Rust, they both internally use UTF-8 String. When linking those two Wasm modules together, the Wasm compiler can notice that fact and then optimize it so that it just copies the raw bytes directly from one linear memory into the other linear memory, without any transcoding.

That means it just needs to do 1 very fast memcpy, instead of allocating a JS string and doing 2 O(n) transcodes. That is much faster than doing a Rust -> JS string -> Rust double conversion.

Consider this other example: imagine two Wasm modules. One of those Wasm modules is compiled from Rust, and the other Wasm module is compiled from a UTF-16 language (like C#).

With your approach, it would need to heap-allocate a JS string, do an O(n) transcode from UTF-8 linear memory into JS, then do a second O(n) transcode from the JS string into UTF-16 linear memory.

However, with stringref it could be optimized so that it just does a single O(n) transcode, copying bytes directly from one linear memory into the other linear memory. Which means it doesn't need to heap-allocate a JS string at all.

Consider this third example: imagine a Wasm module is compiled from Rust. That Wasm module calls host APIs with strings (like browser DOM APIs).

With externref, it must heap-allocate a JS string, then do an O(n) transcoding from Rust into the JS string, and then call the host API. However, with stringref the host API can just read the UTF-8 bytes directly from linear memory, which means it doesn't need to heap-allocate a JS string.

Those sorts of optimizations can't be done with externref, but they can potentially be done with stringref.

Also, this proposal is not only for UTF-8 languages, it's designed to accommodate many languages. Languages which rely on specific memory layouts (like Rust or C++) won't benefit as much, and that's okay.

Outside of the Web, you need some larger interop story to handle more data types than strings, such as records/arrays/etc.

That is handled by other proposals, this proposal is only for strings.

Answer 4 · 2023-06-09T01:07:07.000Z

For example, imagine two Wasm modules, both of which were compiled from Rust.

Because both modules were compiled from Rust, they both internally use UTF-8 String. When linking those two Wasm modules together, the Wasm compiler can notice that fact and then optimize it so that it just copies the raw bytes directly from one linear memory into the other linear memory, without any transcoding.

That means it just needs to do 1 very fast memcpy, instead of allocating a JS string and doing 2 O(n) transcodes. That is much faster than doing a Rust -> JS string -> Rust double conversion.

I may be missing something, but if you’re compiling and linking two rust modules, why would you need JS involved at all?

But assuming that you do, stringref won’t help here. The first module would create a stringref from utf8 in its linear memory, and then the second would encode that into its own linear memory. In SpiderMonkey as noted above, both of these steps will involve a transcode between UTF8 and WTF16. That’s the same situation as with JS String and externref.

Theoretically, SM could gain WTF8 support for strings someday in the future. But in this case, we will still need to materialize the intermediate stringref and cannot just copy from one linear memory to the other directly.

Consider this other example: imagine two Wasm modules. One of those Wasm modules is compiled from Rust, and the other Wasm module is compiled from a UTF-16 language (like C#).

With your approach, it would need to heap-allocate a JS string, do an O(n) transcode from UTF-8 linear memory into JS, then do a second O(n) transcode from the JS string into UTF-16 linear memory.

However, with stringref it could be optimized so that it just does a single O(n) transcode, copying bytes directly from one linear memory into the other linear memory. Which means it doesn't need to heap-allocate a JS string at all..

Again, not sure how to avoid the copy required to materialize the intermediate stringref value that will be consumed by Java. Unless you’re assuming some sort of whole program analysis?

Also, this proposal is not only for UTF-8 languages, it's designed to accommodate many languages. Languages which rely on specific memory layouts (like Rust or C++) won't benefit as much, and that's okay

Sure, I’m just trying to understand how UTF-8 focused languages are expected to benefit from this as that seems like one of design goals.
.

Answer 5 · 2023-06-09T04:19:29.000Z

I may be missing something, but if you’re compiling and linking two rust modules, why would you need JS involved at all?

There are plenty of reasons why you might want to dynamically link Wasm modules together, instead of statically linking Rust crates together.

Again, not sure how to avoid the copy required to materialize the intermediate stringref value that will be consumed by Java. Unless you’re assuming some sort of whole program analysis?

See the old interface types proposal:

https://github.com/WebAssembly/interface-types/blob/main/proposals/interface-types/Explainer.md#adapters

When linking Wasm modules together, the Wasm compiler has full knowledge of adapter functions, so it can inline and optimize them. That means the Wasm compiler can remove redundant copies and unnecessary transcoding. This does not require whole program analysis.

Although adapter functions are not currently a part of any proposals, they are an example of the kind of optimizations that stringref could potentially do in the future.

Answer 6 · 2023-06-09T11:11:33.000Z

I agree with @Pauan that linear-memory languages like Rust are highly unlikely to use stringref as their general-purpose string representation, because stringref provides garbage-collected strings. As discussed above, such languages may nevertheless have certain boundary-related use cases where stringref offers benefits.

So the examples to look for would be managed languages with UTF-8 strings. According to Wikipedia these may include e.g. Go, Julia, PyPy (and Swift, if someone decided to compile it to WasmGC instead of ARC'ed linear memory).

Generally speaking, any such language would use stringview_wtf8 just like other languages use stringview_wtf16: whenever they need to perform an operation on a string that wants to be able to assume that the string is encoded in wtf8/utf8.
If they don't have any such operations and can instead always remain in the realm of encoding-agnostic strings, even better.
Realistically, most string operations need to iterate over the string anyway. Since view creation is (very intentionally!) an explicit step, it's easy for toolchains to create the view before entering the string-iterating loop. That's one of the key reasons why the question "is view creation O(1) or O(n)?" is quite irrelevant in most practical scenarios: if it precedes an O(n) loop anyway, then the overall complexity of the function doesn't change either way.

And even if a given toolchain hasn't learned that trick yet, it's (admittedly a bit of legwork but) not overly difficult for an optimizing compiler in an engine to hoist view creation out of loops. V8 already supports that for as_wtf16 (which in V8 just like in SpiderMonkey flattens ropes), and it would be straightforward to add similar support to as_wtf8 when we have a reason to.

I think the fact that managed-memory UTF-8 based languages so far aren't strongly represented in the world of WasmGC matches (not by coincidence!) the fact that current engines don't have great UTF-8 optimizations. I expect what will happen is that over the years partnerships will emerge (I don't know which) where toolchains and engines work together to solve this chicken-and-egg problem and bring additional languages to Wasm. (As you're probably aware, some of these languages, such as Go, have additional requirements that the WasmGC MVP isn't providing, so there will be more need for collaboration anyway.)

So I'm not worried about the fact that current engines don't yet have highly-optimized support for UTF-8 strings/string_views, and I don't think they need to be in a rush to build it. I think it's a strength of the stringref proposal that it lays the spec-side groundwork for a future where there's more than WTF-16: when UTF-8 based languages become more popular as sources for Wasm modules, we won't have to change the spec; we'll only have to then address the TODO: improve this comments in our engines. I think a proposal that followed a strategy of "WTF-16 is the one thing current implementations can do well, so that should be the only concept we add to the spec" would be inferior.

Answer 7 · 2023-06-09T13:31:21.000Z

I may be missing something, but if you’re compiling and linking two rust modules, why would you need JS involved at all?

There are plenty of reasons why you might want to dynamically link Wasm modules together, instead of statically linking Rust crates together.

What tool chain are you using to dynamically link random wasm modules together? You need some ABI to decide how values from different languages are passed around. If it’s rust modules, you’ll use the rust tool chain and that will not use stringref as part of its internal ABI. If it’s C++, it’ll be a similar situation. There is no toolchain that supports dynamic linking of Rust and C#, except possibly the component-model. And there as noted before, stringref does not give you anything extra for linear memory languages.

Again, not sure how to avoid the copy required to materialize the intermediate stringref value that will be consumed by Java. Unless you’re assuming some sort of whole program analysis?

See the old interface types proposal:

https://github.com/WebAssembly/interface-types/blob/main/proposals/interface-types/Explainer.md#adapters

When linking Wasm modules together, the Wasm compiler has full knowledge of adapter functions, so it can inline and optimize them. That means the Wasm compiler can remove redundant copies and unnecessary transcoding. This does not require whole program analysis.

Although adapter functions are not currently a part of any proposals, they are an example of the kind of optimizations that stringref could potentially do in the future.

The key part of adapter functions which enables that optimization is that adapter functions could match a single lift with a single lower and fuse them to omit the temporary value that would be required. Stringref is not required for that.

Answer 8 · 2023-06-09T13:47:23.000Z

I agree with @Pauan that linear-memory languages like Rust are highly unlikely to use stringref as their general-purpose string representation, because stringref provides garbage-collected strings. As discussed above, such languages may nevertheless have certain boundary-related use cases where stringref offers benefits.

As discussed above, I don’t think there are boundary related benefits above the current state-of-the-art on the Web or off-the-Web.

So the examples to look for would be managed languages with UTF-8 strings. According to Wikipedia these may include e.g. Go, Julia, PyPy (and Swift, if someone decided to compile it to WasmGC instead of ARC'ed linear memory).

Generally speaking, any such language would use stringview_wtf8 just like other languages use stringview_wtf16: whenever they need to perform an operation on a string that wants to be able to assume that the string is encoded in wtf8/utf8. If they don't have any such operations and can instead always remain in the realm of encoding-agnostic strings, even better. Realistically, most string operations need to iterate over the string anyway. Since view creation is (very intentionally!) an explicit step, it's easy for toolchains to create the view before entering the string-iterating loop. That's one of the key reasons why the question "is view creation O(1) or O(n)?" is quite irrelevant in most practical scenarios: if it precedes an O(n) loop anyway, then the overall complexity of the function doesn't change either way.

And even if a given toolchain hasn't learned that trick yet, it's (admittedly a bit of legwork but) not overly difficult for an optimizing compiler in an engine to hoist view creation out of loops. V8 already supports that for as_wtf16 (which in V8 just like in SpiderMonkey flattens ropes), and it would be straightforward to add similar support to as_wtf8 when we have a reason to.

I think the fact that managed-memory UTF-8 based languages so far aren't strongly represented in the world of WasmGC matches (not by coincidence!) the fact that current engines don't have great UTF-8 optimizations. I expect what will happen is that over the years partnerships will emerge (I don't know which) where toolchains and engines work together to solve this chicken-and-egg problem and bring additional languages to Wasm. (As you're probably aware, some of these languages, such as Go, have additional requirements that the WasmGC MVP isn't providing, so there will be more need for collaboration anyway.)

So I'm not worried about the fact that current engines don't yet have highly-optimized support for UTF-8 strings/string_views, and I don't think they need to be in a rush to build it. I think it's a strength of the stringref proposal that it lays the spec-side groundwork for a future where there's more than WTF-16: when UTF-8 based languages become more popular as sources for Wasm modules, we won't have to change the spec; we'll only have to then address the TODO: improve this comments in our engines. I think a proposal that followed a strategy of "WTF-16 is the one thing current implementations can do well, so that should be the only concept we add to the spec" would be inferior.

Do you have any data on a managed language with UTF-8 strings using this proposal? My read from your comment is that it’s expected that for these languages there will be a copy+transcode every time they access their strings, with possibly some optimizations to common up this work in certain cases. I think the default expectation should be that this will have very poor performance, but I could be convinced otherwise if there was data to the contrary.

And I understand that in the future engines could add WTF-8 representations to speed this up and reduce memory usage. However, I would suggest removing these instructions from the proposal until if and when this optimization is generally available. It doesn’t make sense to have WTF-8 support in this proposal if languages will not use it until some future optimization. Codifying it in the spec now would make it harder to make changes if we deemed it necessary when adding WTF-8 support.

Answer 9 · 2023-06-09T17:41:57.000Z

What tool chain are you using to dynamically link random wasm modules together

There are multiple ways to link Wasm modules together, such as Wasmer, or linking the Wasm modules in the browser using the JS APIs. Eventually esm-integration will make linking Wasm modules much easier.

There is even an entire ecosystem of self-contained Wasm modules which are intended to be linked together.

And there is a push for using Wasm in serverless computing (e.g. AWS Lambda and Cloudflare) and also in cryptocurrency. Those also benefit from dynamically linking Wasm modules.

But that's getting very off-topic. Regardless of your opinion on it, some people do dynamically link Wasm modules together, and that is a use case that the WasmWG intends to support.

You need some ABI to decide how values from different languages are passed around.

WASI is an ABI that can be used for Wasm module communication. And the component-model proposal (previously interface-types) is also intended to create an ABI for cross-module communication. stringref is a (small) part of those proposals.

Of course people can create their own ABI as well (e.g. cryptocurrency creates their own ABI for Wasm modules).

The key part of adapter functions which enables that optimization is that adapter functions could match a single lift with a single lower and fuse them to omit the temporary value that would be required.

The goal is to have many different lift / lower instructions, including a "lift / lower from UTF8" instruction, which would copy UTF8 bytes from linear memory into a stringref.

So if the Wasm compiler sees a "lift from UTF8" instruction followed by a "lower from UTF8" instruction, then it can fuse them together and avoid the transcoding and intermediate stringref. This was discussed previously in the interface-types proposal.

Answer 10 · 2023-06-09T20:11:30.000Z

My point wasn't that there is no value in linking together code (or in dynamic linking). My point is that to link any code together they need to share the same ABI. You cannot link completely random wasm modules together.

For linear memory languages there are three options I know of:

Native ABI (Rust, C++, C) - everything is shared in a common linear memory, strings are passed as i32/i64 pointers
JS/Web ABI - inner native code with JS glue code wrapping it, strings are passed as JS Strings
Component Model - adapters copy from one linear memory to another linear memory, no stringref required

None of these benefit from stringref for linear memory languages.

WASI is an ABI that can be used for Wasm module communication. And the component-model proposal (previously interface-types) is also intended to create an ABI for cross-module communication. stringref is a (small) part of those proposals.

stringref is not in those proposals. The component-model (and interface-types before it) uses it's own string type for communicating between components. The component-model could be extended to use it in the future, but as noted above, this doesn't gain anything for linear memory languages.

Answer 11 · 2023-06-09T22:06:42.000Z

@eqrion It seems there's some sort of misunderstanding here, so I'll try to clarify as best as I can...

Many different languages compile to Wasm. Most of those languages want GC strings, so they do not want to put strings into linear memory. That includes languages like Java, C#, Python, Go, etc.

Some languages however don't want GC strings, they do want to put strings into linear memory (Rust, C++, etc.)

In addition to that, languages have different string encodings and representations. C++ has NUL terminated strings, Rust does not. Some languages use UTF-8, some use UTF-16, some use WTF-16, etc.

In addition to that, Wasm modules need to interop with the host. That host could be the browser / JS (which uses WTF-16 GC strings), or it could be something else entirely (Wasmtime, Wasmer, etc.)

One of the goals of Wasm is to allow different Wasm modules to interop with each other, regardless of their source language, and regardless of their internal representation.

That means we need a string ABI which can accommodate as many languages as we can, and can also accommodate many different hosts, and can also accommodate both GC and non-GC strings, while still being efficient.

The current component-model string is an MVP. That means it is intentionally not designed to solve the problem of universal string interop. It's just the simplest thing that works right now.

In particular, the component-model string is always USVString, which does not work for interop with the host, and it also doesn't support GC strings either.

Many Wasm proposals are like that: they do the minimum necessary to get the proposal working, but they leave room for future improvement.

However, in the long term the component-model string is not good enough. We need a string type which can work both for GC languages and non-GC languages, and it must also be able to fully interop with the host as well. That's where stringref comes into play.

stringref is designed to work with both GC languages (like Java, C#, etc.) and also non-GC languages (like Rust). GC languages can just use stringref directly, they no longer need to store strings in linear memory.

However non-GC languages still benefit, because they are now able to seamlessly interop with any other Wasm module, including modules that use GC.

Let's say that I create a Rust Wasm module. I then publish that Rust Wasm module as a library, so other people can use it.

My Rust Wasm module might be linked to any other Wasm module. Since it's a library, I do not know ahead of time which modules it will be linked to.

If my Rust Wasm module uses stringref then it will work seamlessly with any other Wasm module, regardless of their internal string representation, and regardless of whether they use GC or not. My Rust Wasm module can also work seamlessly with any host as well.

In the ideal case where the linked modules have the same string representation (e.g. UTF-8) then the adapter functions will be optimized to remove the redundant transcoding.

In the less ideal case where the linked modules don't have the same string representation, then there is a performance cost, but at least it still works, because both modules are using stringref.

So we get universal string interop regardless of the source language, and regardless of the host, and the performance is optimized. This is something that the component-model string cannot do, and externref cannot do it either, but stringref can.

Answer 12 · 2023-06-16T18:25:10.000Z

One of the goals of Wasm is to allow different Wasm modules to interop with each other, regardless of their source language, and regardless of their internal representation.

That means we need a string ABI which can accommodate as many languages as we can, and can also accommodate many different hosts, and can also accommodate both GC and non-GC strings, while still being efficient.

Where are you seeing this as a goal of WebAssembly? I don't see it on the listed high-level goals. This looks like more of a goal of the component-model, which is a separate layer from wasm.

stringref is designed to work with both GC languages (like Java, C#, etc.) and also non-GC languages (like Rust). GC languages can just use stringref directly, they no longer need to store strings in linear memory.

Languages using Wasm-GC can already store them in arrays of i8 or i16. From the above discussion, it sounds like linear memory languages won't use stringref internally, and I'm also unsure if they would use them as part of their ABI instead of externref.

However non-GC languages still benefit, because they are now able to seamlessly interop with any other Wasm module, including modules that use GC.

Let's say that I create a Rust Wasm module. I then publish that Rust Wasm module as a library, so other people can use it.

My Rust Wasm module might be linked to any other Wasm module. Since it's a library, I do not know ahead of time which modules it will be linked to.

If my Rust Wasm module uses stringref then it will work seamlessly with any other Wasm module, regardless of their internal string representation, and regardless of whether they use GC or not. My Rust Wasm module can also work seamlessly with any host as well.

My point above about ABI is that you cannot publish a Rust wasm module to be linked with any arbitrary other wasm module from possibly a different language. Strings are a small part of the Rust ABI, you would also need to defined structs, arrays, tuples, enums, references, pointers, etc. You need every detail to line up for linking to work.

The only proposal I know of that is tackling this issue is the component-model and that should not be confused with the core wasm instruction set.

Answer 13 · 2023-06-16T18:33:32.000Z

I want to refocus this issue on UTF-8 users of this proposal, so going back to my earlier point:

Do we have any data on a managed language with UTF-8 strings using this proposal? My concern is that for languages running in engines without native UTF-8 string representations (all of them in the near to medium term), there will be very frequent copy+transcode operations as they'll need to reacquire the non-native utf8-view for every operation and access.

I understand we could sometimes common up acquiring the views using local function optimizations, but I don't think that will be enough to have good performance.

Answer 14 · 2023-06-19T12:03:19.000Z

I'm not aware of any concrete data so far; the Scheme-to-Wasm compiler that @wingo is working on is probably closest to being able to generate such data, but I don't know what "closest" means in actual calendar terms.

they'll need to reacquire the non-native utf8-view for every operation and access

It's not every operation. Specifically, the following operations don't need to acquire a utf8-view (regardless of whether each of these is an instruction or an import):

comparisons
concatenations
toUpper/toLower case conversions
to/from i32/f64 conversions
iteration over codepoints
slicing by iterator

Whereas this is the list of operations that do need to acquire a utf8-view:

slicing by byte offset

This is why I'm not worried that performance will be at least "okay" right out of the box, even on engines without UTF-8 optimizations.

Answer 15 · 2023-06-21T17:33:00.000Z

I'm not aware of any concrete data so far; the Scheme-to-Wasm compiler that @wingo is working on is probably closest to being able to generate such data, but I don't know what "closest" means in actual calendar terms.

they'll need to reacquire the non-native utf8-view for every operation and access

It's not every operation. Specifically, the following operations don't need to acquire a utf8-view (regardless of whether each of these is an instruction or an import):
* comparisons

* concatenations

* toUpper/toLower case conversions

* to/from i32/f64 conversions

* iteration over codepoints

* slicing by iterator
Whereas this is the list of operations that do need to acquire a utf8-view:
* slicing by byte offset
This is why I'm not worried that performance will be at least "okay" right out of the box, even on engines without UTF-8 optimizations.

This relies on having these operations be implemented in the host (as instruction or import) where it can use the native encoding format directly. For UTF-8 specific operations that aren't standardized across hosts, these will need to be emulated in wasm code. In some cases they may be able to use the iterator interface, but not if their source code is written to use indices into the raw bytes. Using indices to access strings is pretty common in standard libraries [1], and I would guess it also happens in user code too.

It's theoretically possible this could have acceptable performance, but I think there are very good reasons to default to thinking that representing a source languages string in a non-native encoding won't work. It just takes a loop, a large string, and indexing operation on it to get a ton of copies+transcodes. And if I was a source language compiling to wasm, I would want control over this to prevent this from happening and would avoid using stringref if this was a possibility.

[1] https://cs.opensource.google/go/go/+/refs/tags/go1.20.5:src/strings/strings.go;l=1049

Answer 16 · 2023-06-21T19:39:52.000Z

This relies on having these operations be implemented in the host

I strongly believe that you generally want all performance-critical "bulk" operations (processing an entire string at once) to be implemented in the host, because that makes them much faster. This is very similar to the memory bulk operations.

It just takes a loop, a large string, and indexing operation on it to get a ton of copies+transcodes.

Of course you wouldn't want "a ton of copies+transcodes" for a single loop. That's why this proposal offers view creation as an explicit step, so you do have control to do that just once, before entering the loop. (And in the fullness of time, engines will turn even that into a no-op.)

The proposal also offers the ultimate escape hatch of converting strings to arrays (and back), for arbitrary manipulation.

Using indices to access strings is pretty common in standard libraries [1]

The specific Go loop you linked to would be well expressible in a direct translation, in particular all the indexing it does:

for Go's ==, there's string.eq
for Go's Builder, there's a choice of:
- repeated string.concat
- using an array, and stringview_wtf8.encode_utf8_array to write into it, and finally string.new_utf8_array
for Go's len, there's string.measure_utf8
for Go's utf8.DecodeRuneInString, there's stringview_wtf8.advance
for Go's s[start:j], there's stringview_wtf8.slice
Go's Count() and Index() would have to be added to the proposal, or imported, or implemented in userspace on top of the proposal (for which the proposal offers the required building blocks).

(FWIW, I do think that stringview_wtf8.get_codepoint and/or stringview_wtf8.get_raw_byte instructions might well be worthwhile additions. It would be great to get implementation feedback from a UTF-8 based language. Lacking that, we could leave out the stringview_wtf8 part for now; or we could decide that it's so small that what's already there is exceedingly likely to be useful, as e.g. the Go example above shows, and whatever primitives turn out to be missing can easily be added later.)

Answer 17 · 2023-06-21T20:42:38.000Z

This relies on having these operations be implemented in the host

I strongly believe that you generally want all performance-critical "bulk" operations (processing an entire string at once) to be implemented in the host, because that makes them much faster. This is very similar to the memory bulk operations.

Memory bulk operations have a nice complexity/benefit payoff in my opinion. They're simple loops of loads/stores and are extremely common and hot in programs. String operations (assuming things like toUpper/trim/split) are much harder to specify as there are an order of magnitude more of them across different languages (with incompatible variants of the same concept). And it's unclear to me how much better a host string 'trim' method could be over a wasm string 'trim' method to justify the complexity.

It just takes a loop, a large string, and indexing operation on it to get a ton of copies+transcodes.

Of course you wouldn't want "a ton of copies+transcodes" for a single loop. That's why this proposal offers view creation as an explicit step, so you do have control to do that just once, before entering the loop. (And in the fullness of time, engines will turn even that into a no-op.)

I think the problem I'm getting at is that the Go string type would need to be a stringref type to support equals/concat/etc and every string index expression would then need to get the stringview_wtf8 of the stringref and then index into that.

If your source language compiler can reliably hoist the stringview_wtf8 acquisition outside of all indexing loops (including if the indexing happens in a separate function that needs to be inlined), then great. But if you mess it up, you're in the situation I described.

Answer 18 · 2023-06-21T20:51:55.000Z

Go may not be a good example here, their string type is just bytes with the encoding given to it by each operation that accesses it. So it's not clear to me that they would use stringref.

Answer 19 · 2023-06-22T10:13:24.000Z

But if you mess it up, you're in the situation I described.

If the X-to-Wasm compiler messes it up and produces a suboptimal module, then there's still a good chance that a sufficiently smart engine will save the day, by doing the hoisting engine-side.

This isn't specific to UTF-8, or stringref, or even Wasm! Even in the existing case of JS strings in a JS engine, you don't want to check "is this string a rope that needs to be flattened?" in every iteration of a string-indexing JS loop. You want to perform that check only once, before the loop. JS doesn't have a way to express that in the language, so engines have no other option but to do it automatically under the hood. If an engine can do that for JS strings, then the same technique can make string view creation acceptably efficient in cases where the engine doesn't have a native string representation matching the requested view.

So compared to the status quo in JS, the stringref proposal's concept of views improves the situation (in typical Wasm fashion) by (1) giving module producers more control and (2) making engines' lives easier.