WebAssembly/stringref

Library functionality: toUpperCase, etc

wingo opened this issue · 11 comments

wingo commented

From wingo/stringrefs#35, if we think about a browser deployment environment, we would like to avoid a situation in which two parts of a running system are implementing toUpperCase on a JavaScript string. If WebAssembly needs this or related string functionality, it should be able to use what's already in the browser.

So what's the story here? A couple options:

  1. Wild west: if you know you're targetting a browser, generate an associated JS wrapper that calls the String.prototype or Intl.prototype functions that you need
  2. Standard library: Regardless if you're targetting WASI or the web, you know there is a standard module you can import that provides the functionality you need. On the web, such a module delegates to String, Intl, and so on.
  3. Core: There is a string.to_upper instruction. Lots more, probably.

I have to say, I see (1) as an OK short-term answer and (2) as a nice end result. I wouldn't do (3) but who knows!

FYI, the J2Wasm team has reported a list of features for which they're planning to call imported JS functions for now:

    'String.fromCharCode': String.fromCharCode,
    'String.indexOf': (s, r, i) => s.indexOf(r, i),
    'String.lastIndexOf': (s, r, i) => s.lastIndexOf(r, i),
    'String.replace': (s, re, r) => s.replace(re, r),
    'String.toLowerCase': (s) => s.toLowerCase(),
    'String.toUpperCase': (s) => s.toUpperCase(),
    'String.toLocaleLowerCase': (s) => s.toLocaleLowerCase(),
    'String.toLocaleUpperCase': (s) => s.toLocaleUpperCase(),

I don't mind going with (1) for now. I think we may want a better answer eventually, but it's probably fine to postpone that until post-MVP.

We could also consider adding a string.new_scalar : [i32] -> [(ref string)] right away, to address the first case.

Regarding possible engine-side optimization, there's an interesting difference between String.fromCharCode and the rest of the list: since that first entry is a JavaScript built-in as-is, we can theoretically detect that and replace the function call with an inlined instruction sequence; we already support similar tricks for asm.js code using Math.* functions. All the other features need custom JS functions in order to move the first argument to the receiver position (they could use String.prototype.indexOf.call(s, r, i) instead of s.indexOf(r, i) but that would still be a custom JS function). These are much harder (possibly infeasible, or pointless) to recognize and inline, in particular because JavaScript is so dynamic: while it's exceedingly unlikely that a real-world program would override any of the String.prototype.* methods halfway through its execution, engines would always have to check for that possibility.

We could also consider adding a string.new_scalar : [i32] -> [(ref string)] right away, to address the first case.

With a single operand, the behavior there should probably be more like String.fromCodePoint, since String.fromCharCode would require two arguments to produce supplementary code points from a surrogate pair.

Regarding possible engine-side optimization, there's an interesting difference between String.fromCharCode and the rest of the list

Another aspect here might be that String.fromCharCode takes variable length arguments, so would require multiple imports with different signatures. Since the input number of arguments would most likely be dynamic, however, it is just as likely that an implementation will fall back to some sort of reflection to call it, say String.fromCharCode.apply(String, arrayOfCharCodes). And if an implementation does that, it is likely that it also will do some sort of chunking, since String.fromCharCode easily overflows the stack otherwise.

they could use String.prototype.indexOf.call(s, r, i)

This is something that has been on my mind for quite some time as well. Basically two aspects I think that would improve Wasm<->JS interop via imports significantly: 1) A mechanism to call instance / prototype methods as in your example, and 2) the ability to new imported constructor functions, say to do things like new Date().getTimezoneOffset(). I guess if these cases could be indicated somehow, say with additional options on imports ("this is a constructor", "this is an instance method"), optimizing these would become more feasible? Varargs calls to imports for functions like fromCharCode could be a potential 3rd.

gkdn commented

Isn't String.prototype.indexOf something that the engine could recognize as prototype function and so it can consider the first argument as receiver or should we need to explicitly tell that while importing the function?

wingo commented

The prototype lookup is indeed annoying. Could we instead import Function.call.bind(String.prototype.indexOf) et al? It would require a bit of work but you could see at compile-time that the import is a bound function, that the function itself is Function.call, and that the callee is e.g. String.prototype.indexOf. That way you capture indexOf and friends early, allowing the engine to inline and also preventing monkeypatching from altering the meaning of the indexOf operation.

wingo commented

I am also intrigued about the use of String.fromCharCode by the J2Wasm compiler. This would seem to suggest that ropes are a feature that are necessary for the MVP. Am I interpreting that right?

gkdn commented

Sorry I'm confused by how Function.call.bind helps here. How canString.prototype.indexOf change after importing if it is directly imported as String.prototype.indexOf?

Or are you referring to scenario where the import is bound to a free JavaScript function that makes String API calls?

I am also intrigued about the use of String.fromCharCode by the J2Wasm compiler. This would seem to suggest that ropes are a feature that are necessary for the MVP. Am I interpreting that right?

String.fromCharCode is used only to satisfy String.valueOf(char x) API. It is not used to generally construct larger strings (if that's what you are concerned about - I'm not sure if there any other connections to ropes here?).

Being said that, I was considering to use string.concat for the builder implementation which I assumed would require ropes but it also sounded like they are already available per Jakub's comment so I'm little bit lost there: https://docs.google.com/document/d/1w2jLY7LuMG1grm_u7avtoAqYW1tcvPt4zc5_yNwTRyQ/edit?disco=AAAAelBHHaE

wingo commented

@gkdn If I understand you correctly, you aren't actually importing String.prototype.indexOf, you are importing (s, r, i) => s.indexOf(r, i). This will look up indexOf on String.prototype at run-time and then call it with s as the "receiver" and then the two additional arguments. A user can mutate String.prototype.indexOf after you capture that arrow function and your indexOf will then use their chosen indexOf.

Calling a JS function from WebAssembly passes null as the receiver (step 5).. To explicitly pass the receiver, you need to use Function.call. The bind method specializes call to indexOf in a way that is not subject to user mutation.

Regarding String.fromCharCode, it is often used in JS to build up strings one char at a time. Thanks for the details regarding your usage of it.

I was considering to use string.concat for the builder implementation which I assumed would require ropes but it also sounded like they are already available

Correct. In the current implementation in V8, Wasm's string.concat gives you exactly the same ropes as JS's string1 + string2.

gkdn commented

@wingo I wasn't sure if you were responding to #5 (comment) which I sent with the intention to reply around previous comment (..if these cases could be indicated somehow, say with additional options on imports..).

But yes if it can help V8 to optimize, we can do the Function.call.bind trick.

gkdn commented

@jakobkummerow String.fromCodePoint is showing up at hot code path. Should we start look into detecting and replacing them or introducing new APIs?

There is also a related discussion to be had about the fact JS has access to a lot of Unicode data via RegExp however if languages want to access this data they basically have to compile a bunch of individual regexpes from what they're doing.

Like if a language already has Unicode support then compiling that to WASM would require either:

  1. Including the whole unicode data (or at least a lot depending on how generic the compiled libraries are)
  2. Transforming any unicode to a large number of JS regexpes and exposing them as imports

both have pretty big downsides.

For the first the unicode data is pretty large AND it might lead to divergence of versions between the compiled WASM and whatever the host has.

For the second a a lot of round-tripping between JS and WASM would be involved even for trivial operations like checking a unicode property. Given that round-tripping seems to be less than optimal even for something as basic as String.fromCodePoint one would expect round-tripping through regexes to be a lot more costly.

It would probably be good if there were unicode related instructions to perform the same functionality directly such as:

// e.g. ID_Start, White_Space, etc
(unicode.matches_binary_property ($codePoint : i32) ($propertyName : stringref)) → i32
// e.g. Script=Greek, General_Category=Punctuation
(unicode.matches_property ($codePoint : i32) ($propertyName : stringref) ($propertyValue : stringref)) → i32
// e.g. RGI_Emoji_ZWJ_Sequence
(unicode.matches_sequence_property ($string : stringref) ($propertyName : stringref)) → i32