Library functionality: toUpperCase, etc
wingo opened this issue · 11 comments
From wingo/stringrefs#35, if we think about a browser deployment environment, we would like to avoid a situation in which two parts of a running system are implementing toUpperCase
on a JavaScript string. If WebAssembly needs this or related string functionality, it should be able to use what's already in the browser.
So what's the story here? A couple options:
- Wild west: if you know you're targetting a browser, generate an associated JS wrapper that calls the
String.prototype
orIntl.prototype
functions that you need - Standard library: Regardless if you're targetting WASI or the web, you know there is a standard module you can import that provides the functionality you need. On the web, such a module delegates to
String
,Intl
, and so on. - Core: There is a
string.to_upper
instruction. Lots more, probably.
I have to say, I see (1) as an OK short-term answer and (2) as a nice end result. I wouldn't do (3) but who knows!
FYI, the J2Wasm team has reported a list of features for which they're planning to call imported JS functions for now:
'String.fromCharCode': String.fromCharCode,
'String.indexOf': (s, r, i) => s.indexOf(r, i),
'String.lastIndexOf': (s, r, i) => s.lastIndexOf(r, i),
'String.replace': (s, re, r) => s.replace(re, r),
'String.toLowerCase': (s) => s.toLowerCase(),
'String.toUpperCase': (s) => s.toUpperCase(),
'String.toLocaleLowerCase': (s) => s.toLocaleLowerCase(),
'String.toLocaleUpperCase': (s) => s.toLocaleUpperCase(),
I don't mind going with (1) for now. I think we may want a better answer eventually, but it's probably fine to postpone that until post-MVP.
We could also consider adding a string.new_scalar : [i32] -> [(ref string)]
right away, to address the first case.
Regarding possible engine-side optimization, there's an interesting difference between String.fromCharCode
and the rest of the list: since that first entry is a JavaScript built-in as-is, we can theoretically detect that and replace the function call with an inlined instruction sequence; we already support similar tricks for asm.js code using Math.*
functions. All the other features need custom JS functions in order to move the first argument to the receiver position (they could use String.prototype.indexOf.call(s, r, i)
instead of s.indexOf(r, i)
but that would still be a custom JS function). These are much harder (possibly infeasible, or pointless) to recognize and inline, in particular because JavaScript is so dynamic: while it's exceedingly unlikely that a real-world program would override any of the String.prototype.*
methods halfway through its execution, engines would always have to check for that possibility.
We could also consider adding a string.new_scalar : [i32] -> [(ref string)] right away, to address the first case.
With a single operand, the behavior there should probably be more like String.fromCodePoint
, since String.fromCharCode
would require two arguments to produce supplementary code points from a surrogate pair.
Regarding possible engine-side optimization, there's an interesting difference between String.fromCharCode and the rest of the list
Another aspect here might be that String.fromCharCode
takes variable length arguments, so would require multiple imports with different signatures. Since the input number of arguments would most likely be dynamic, however, it is just as likely that an implementation will fall back to some sort of reflection to call it, say String.fromCharCode.apply(String, arrayOfCharCodes)
. And if an implementation does that, it is likely that it also will do some sort of chunking, since String.fromCharCode
easily overflows the stack otherwise.
they could use
String.prototype.indexOf.call(s, r, i)
This is something that has been on my mind for quite some time as well. Basically two aspects I think that would improve Wasm<->JS interop via imports significantly: 1) A mechanism to call instance / prototype methods as in your example, and 2) the ability to new
imported constructor functions, say to do things like new Date().getTimezoneOffset()
. I guess if these cases could be indicated somehow, say with additional options on imports ("this is a constructor", "this is an instance method"), optimizing these would become more feasible? Varargs calls to imports for functions like fromCharCode
could be a potential 3rd.
Isn't String.prototype.indexOf
something that the engine could recognize as prototype function and so it can consider the first argument as receiver or should we need to explicitly tell that while importing the function?
The prototype lookup is indeed annoying. Could we instead import Function.call.bind(String.prototype.indexOf)
et al? It would require a bit of work but you could see at compile-time that the import is a bound function, that the function itself is Function.call
, and that the callee is e.g. String.prototype.indexOf
. That way you capture indexOf
and friends early, allowing the engine to inline and also preventing monkeypatching from altering the meaning of the indexOf
operation.
I am also intrigued about the use of String.fromCharCode
by the J2Wasm compiler. This would seem to suggest that ropes are a feature that are necessary for the MVP. Am I interpreting that right?
Sorry I'm confused by how Function.call.bind
helps here. How canString.prototype.indexOf
change after importing if it is directly imported as String.prototype.indexOf
?
Or are you referring to scenario where the import is bound to a free JavaScript function that makes String API calls?
I am also intrigued about the use of String.fromCharCode by the J2Wasm compiler. This would seem to suggest that ropes are a feature that are necessary for the MVP. Am I interpreting that right?
String.fromCharCode
is used only to satisfy String.valueOf(char x)
API. It is not used to generally construct larger strings (if that's what you are concerned about - I'm not sure if there any other connections to ropes here?).
Being said that, I was considering to use string.concat for the builder implementation which I assumed would require ropes but it also sounded like they are already available per Jakub's comment so I'm little bit lost there: https://docs.google.com/document/d/1w2jLY7LuMG1grm_u7avtoAqYW1tcvPt4zc5_yNwTRyQ/edit?disco=AAAAelBHHaE
@gkdn If I understand you correctly, you aren't actually importing String.prototype.indexOf
, you are importing (s, r, i) => s.indexOf(r, i)
. This will look up indexOf
on String.prototype
at run-time and then call it with s
as the "receiver" and then the two additional arguments. A user can mutate String.prototype.indexOf
after you capture that arrow function and your indexOf
will then use their chosen indexOf
.
Calling a JS function from WebAssembly passes null
as the receiver (step 5).. To explicitly pass the receiver, you need to use Function.call
. The bind
method specializes call
to indexOf
in a way that is not subject to user mutation.
Regarding String.fromCharCode
, it is often used in JS to build up strings one char at a time. Thanks for the details regarding your usage of it.
I was considering to use string.concat for the builder implementation which I assumed would require ropes but it also sounded like they are already available
Correct. In the current implementation in V8, Wasm's string.concat
gives you exactly the same ropes as JS's string1 + string2
.
@wingo I wasn't sure if you were responding to #5 (comment) which I sent with the intention to reply around previous comment (..if these cases could be indicated somehow, say with additional options on imports..
).
But yes if it can help V8 to optimize, we can do the Function.call.bind
trick.
@jakobkummerow String.fromCodePoint is showing up at hot code path. Should we start look into detecting and replacing them or introducing new APIs?
There is also a related discussion to be had about the fact JS has access to a lot of Unicode data via RegExp however if languages want to access this data they basically have to compile a bunch of individual regexpes from what they're doing.
Like if a language already has Unicode support then compiling that to WASM would require either:
- Including the whole unicode data (or at least a lot depending on how generic the compiled libraries are)
- Transforming any unicode to a large number of JS regexpes and exposing them as imports
both have pretty big downsides.
For the first the unicode data is pretty large AND it might lead to divergence of versions between the compiled WASM and whatever the host has.
For the second a a lot of round-tripping between JS and WASM would be involved even for trivial operations like checking a unicode property. Given that round-tripping seems to be less than optimal even for something as basic as String.fromCodePoint
one would expect round-tripping through regexes to be a lot more costly.
It would probably be good if there were unicode related instructions to perform the same functionality directly such as:
// e.g. ID_Start, White_Space, etc
(unicode.matches_binary_property ($codePoint : i32) ($propertyName : stringref)) → i32
// e.g. Script=Greek, General_Category=Punctuation
(unicode.matches_property ($codePoint : i32) ($propertyName : stringref) ($propertyValue : stringref)) → i32
// e.g. RGI_Emoji_ZWJ_Sequence
(unicode.matches_sequence_property ($string : stringref) ($propertyName : stringref)) → i32