Injecting platform specific instructions or blocks as inlined function imports at compile-time (capabilities)

Question

Injecting platform specific instructions or blocks as inlined function imports at compile-time (capabilities)

ttraenkler opened this issue 2 years ago · 26 comments

The discussion during the CG presentation seemed to centered mostly around the question of how to in general handle the tradeoff between portability and platform specific capabilities as this could break compatibility with certain platforms, unexpected performance cliffs, manually curating alternative code paths, or unconsciously introducing non determinism or runtime traps by importing a module as a sub-dependency using a non portable feature.

I would like to come back to the suggestion on slide 11 of the presentation given at the in person CG meeting, to handle "Relaxed Instructions as Imports" since this to me seems to be the most desirable of those options.

As stated in the slide the upside would be:

"Sidesteps introducing a new kind of non-determinism in WAsm": In other words, this moves the problem outside of the module by allowing the module consumers to take the decision best for their use case by providing the instruction as a normal function import that resolves how this instruction is handled: It could be fast but non deterministic between architectures, forced to emulate one version in software on architectures that do not support the instruction or could implement it completely differently with an arbitrary function that applies custom logic or alternative code paths (if/else) matching the needs of the consumer in their specific case, which gives control to the consumer instead of forcing a one size fits all hurting either determinism or performance constraints. In contrast to a simple macro like if/else block, this decision could be taken locally just for one module, and also would put the module consumer in control of the implementation, who would otherwise be bound to the solution deemed most suitable by the module implementer. If the instruction would not be provided as an import explicitly, the host could either have a globally set default depending on the normal solution in this environment, or it could be provided as "undefined" so the module could detect this capability does not exist and if it cannot handle it would "trap" predictably at compile or link time instead of an undesirable runtime trap or provide a default implementation itself. This solutions seems to encompass both the ups of the other suggestions "drop consistency" and "explicit tests for arch-specifics" and similar to WASI capabilities limits the blast radius of the taken tradeoff for the module. I believe this could also be a nice feature detection mechanism, by simple allowing an optional import for the non core Wasm features (that the host could provide by default if the consumer does not object or override).

Regarding the two downsides mentioned on slide 11:

"The difference is mostly theoretical, moving non-determinism from Wasm into environment doesn’t change anything in practice": I assume this is in comparison to the explicit tests for arch specifics (slide 12). Imho it is not mostly theoretical, as it is much more flexible and puts control into the hands of the consumer that can pick a default or customize the behavior based on their platform and task requirements. "Explicit tests for arch-specifics" in contrast would take the control out of the hands of the consumer how the instruction is implemented and forcing them to answer a simple yes/no for all modules to enable this feature. So lost would be the consumer choice per module and the freedom to taking application requirements into consideration by adapting the implementation to it, which the module implementer without knowledge of the use case naturally cannot do.
"Mozilla suggested this does not fit well into SpiderMonkey": What in specific is the problem with this suggestion in relation with SpiderMonkey? Would this be resolved with "pre-imports" as proposed in @rossberg Staged compilation and module linking presentation by allowing to provide the implementation at the time of module compilation as compare to instantiation?

The potential "downside" of platform specific instructions as function imports could be that modules cannot rely on these instructions to behave identical across modules anymore, so they should be seen as (potentially hardware accelerated) functions instead - but this stems from the nature of "relaxed" platform specific instructions itself, so I believe is a better tradeoff to make than introducing a one size fits all solution globally and stop out certain use cases completely or splitting the ecosystem.

Answer 1 · 2022-10-31T15:32:49.000Z

2. **"Mozilla suggested this does not fit well into SpiderMonkey"**: What in specific is the problem with this suggestion in relation with SpiderMonkey? Would this  be resolved with "pre-imports" as proposed in @rossberg Staged compilation and module linking presentation by allowing to provide the implementation at the time of module compilation as compare to instantiation?

It's not so much a problem with SpiderMonkey as it is with the general compilation model of WebAssembly so-far. When we compile a module, we don't know anything about the import other than the function type. The import name is only used for linking, and cannot be used to signify that the import should have an alternate behavior.

What this means is that if you were to import one of these instructions as a function, we would need to compile it to essentially an indirect function call, as we don't know whether the special 'intrinsic function' (or whatever it would be called) is going to be provided as the import or something else (like a JS function). This would likely hurt the performance gains of the relaxed instructions pretty significantly.

Something like "pre-imports" could solve this, but there would need to be an actual proposal to evaluate this to say this for sure. Additionally, the presentation Rossberg gave at the meeting relied on a version of the module linking proposal, which is not a quick feature to add to SpiderMonkey (I would guess other JS engines as well, but don't know for sure). So it could be some time to add it if this proposal required it.

Answer 2 · 2022-10-31T15:52:36.000Z

Thanks for clarifying, so it's indeed the dynamic nature of the imports not known at compile time.

Something like "pre-imports" could solve this, but there would need to be an actual proposal to evaluate this to say this for sure. Additionally, the presentation Rossberg gave at the meeting relied on a version of the module linking proposal, which is not a quick feature to add to SpiderMonkey (I would guess other JS engines as well, but don't know for sure). So it could be some time to add it if this proposal required it.

Waiting for the module linking proposal to be revived and landed is probably on a timeframe blocking the progress of this proposal for too long. I imagine to make this work what would be needed is a way to import a function at compile time - could be called a "compile time" or "static" import.

The simplest option that comes to mind for browsers until the day the module linking proposal arrives would be to extend compileStreaming with a second parameter like in instantiateStreaming and provide the import there? Other runtimes could do a similar thing. This would maybe also help clarifying the semantic difference between compileStreaming and instantiateStreaming as from what I have heard instantiateStreaming is sometimes the function where the code is actually compiled in some engines.

Answer 3 · 2022-10-31T16:00:05.000Z

I imagine it could also be desirable for performance optimization reasons in general other than SIMD (direct function calls, inlining) to provide some imports at compile time.

Answer 4 · 2022-10-31T18:27:19.000Z

I feel there're some misunderstandings as to what this suggestion entailed. The suggestion was to define Relaxed SIMD instructions as "imported" functions. However, these functions can't really be imported: as @eqrion explained above, this would require WAsm engine to emit a call instruction for each use of the Relaxed SIMD operations, as it doesn't know the definition of these operations at the time it generates machine code. So instead, the WAsm engines would have to treat these "imports" as built-in operations, i.e. detect a call to function named e.g. "wams_relaxed_madd_f32x4" and emit Relaxed Multiply-Add operation without a function call. Thus, it sidesteps defining Relaxed SIMD operations in WAsm bytecode format, but instead makes them a part of the environment provided by the WAsm engine. This is where the major drawback comes from: "The difference is mostly theoretical", as we've removed these non-deterministic operations from WAsm bytecode only to make them a required part of the environment this bytecode depends on.

Answer 5 · 2022-10-31T19:06:15.000Z

@Maratyszcza That's the issue that "pre-imports", or "staged compilation" is proposed to solve, by having a compilation stage where some imports are resolved in a separate stage, before the main compilation stage begins.

Answer 6 · 2022-10-31T19:11:39.000Z

IIUC, pre-imports doesn't imply that these functions are inlined, and in this use-case inlining is critical.

Answer 7 · 2022-10-31T19:14:55.000Z

Pre-imports would imply that they can be inlined, without changing the rest of the compilation model of Wasm.

Answer 8 · 2022-10-31T20:03:15.000Z

Could compiler engineers working on the runtimes comment on this idea if a "static function import" compilation stage would unlock inlining of imported functions and if this would be feasible to be added to existing runtimes?

Could this compilation stage be polyfilled by a preprocessing tool in the style of wasm-ld taking multiple modules as imports, producing a single module as output so this can be done without runtime support in the meanwhile?

Answer 9 · 2022-10-31T20:41:57.000Z

Pre-imports would imply that they can be inlined, without changing the rest of the compilation model of Wasm.

For pre-imports to have acceptable performance they have to be inlined, and it will have immense cost in runtimes that don't inline. Implementing relaxed SIMD via pre-imports would require runtimes that want to support it to implement function inlining first. This is in addition to "non-determinism moving around" problem which @Maratyszcza explained so well above.

Answer 10 · 2022-11-01T03:40:46.000Z

For pre-imports to have acceptable performance they have to be inlined, and it will have immense cost in runtimes that don't inline. Implementing relaxed SIMD via pre-imports would require runtimes that want to support it to implement function inlining first.

Thanks for these insights. While not being a compiler engineer I have a limited understanding of how difficult it would be to implement function inlining of macro expansion for compile time static function imports, I do understand this would require sth like inlining or macros.

What would a MVP of a suitable mechanism like function inlining or macro expansion of definitions provided at compile time look like in terms of complexity and is there interest in this from the side of compiler engineers? Is there a prior discussion or proposal already?

Answer 11 · 2022-11-02T16:27:25.000Z

Trying to touch on a couple of the different discussion points here.

Normal function imports are not useful here because names are not significant and the imported function is not known at compile time, resulting in indirect function calls being emitted. This is why something 'new' would be required, like @Maratyszcza mentioned.
Pre-imports or some form of staged compilation would make it so the imported function is known at compile time. This would allow (but not require) an imported function to be inlined or handled in some special way.
Engines could make a (non-specified) guarantee that whenever a relaxed SIMD intrinsic function (e.g. WebAssembly.relaxedMulAdd) is pre-imported, that we will inline it and it will be as fast as an instruction.
The most general form of pre-imports or staged compilation based on module linking is likely a lot of work for engines. It's not clear what a MVP proposal in this space would look like, I don't think anyone has proposed one. My guess would be finding a way to limit the pre-imports to just functions with certain simple params and result types.
A form of inlining certain intrinsic functions that are known at compile-time 'somehow' likely wouldn't be too hard to implement in SpiderMonkey.
The meta point of 'this just moves the non-determinism around' from @Maratyszcza is valid. It sounds like any solution here is going to have non-determinism somewhere, and we probably just need to find place that makes the most people comfortable. I'm not sure if host provided intrinsic imports really is a large improvement for the downsides it introduces.

Answer 12 · 2022-11-02T17:20:39.000Z

I agree that getting "the one instruction" inlined into the final machine code is necessary to get the performance benefits. It's likely the cost of a function call would erase all the benefits from having the special instructions, so no solution that involves a function call in the normal case is viable.

There is a spectrum of possibilities to achieve the effect of emitting the "one right instruction", short of the module linking/pre-imports idea. Two ideas:

Defer compiling Wasm functions that call these special functions until instantiation time by being conservative when matching import names; i.e. if the import name looks like a relaxed SIMD function, don't compile (direct) callers of it until instantiation.
Speculatively match import names to relaxed SIMD functions at compile time and generate code assuming they will be bound properly. Verify that import bindings do indeed match speculation at instantiation time and recompile (to be indirect calls) if speculation was wrong.

Both of these trade some guarantees about exactly when compilation happens, but both get the same end result if the import bindings match properly. In the case of (1) it means that not all compilation can happen before imports are known. In the case of (2) it means that it can, speculatively, at the cost of backing out in case bindings are different.

AFAICT both of these options require no spec changes, no new APIs, and no new Wasm mechanisms. In particular, with instantiateStreaming(), the imports are actually known at compile time. We could even recommend that modules that want relaxed SIMD should prefer to be instantiated through streaming or through the instantiate() overload that takes the raw module bytes.

In general I want to solve the staging problem for Wasm, and do it right, but also do not want to either rush that proposal or block this proposal. So my current best guess is that speculative compilation can work as a holdover and does not commit us to any particular choice here.

It's also worth noting that V8 does a little bit of trickery to make imports of Math.* faster, bypassing a roundtrip through the JS stubs and going directly to C stubs. It's not quite on this level, but worth mentioning.

Another thing worth noting is that we could approximate staging at the JS API level by allowing binding only a subset of imports, with the result being a module with residual imports, rather than an instance.

Answer 13 · 2022-11-02T17:22:40.000Z

@eqrion: Agree on your assessments on 1-4.

A form of inlining certain intrinsic functions that are known at compile-time 'somehow' likely wouldn't be too hard to implement in SpiderMonkey.

This would be my hope, as in the simplest case it could be something like a macro like "copy paste".

The meta point of 'this just moves the non-determinism around' from @Maratyszcza is valid. It sounds like any solution here is going to have non-determinism somewhere, and we probably just need to find place that makes the most people comfortable.

I agree it moves non-determinism around, but allows the module consumer (which is recursive down to the end user) to pick the tradeoff that actually has knowledge of the use case instead of forcing the implementer to take a decision oblivious of the use case. Also, the consumer can set a reasonable default strategy for their use case so modules behave consistently or even override their default and it would avoid unexpected runtime traps as this would be known at compile time.

This is not the same as it puts the choice in the hands of the module consumer and retains predictable composability of modules as opposed to forcing the implementer to take a decision in a situation where all options have downsides and their pick might be inadequate to the use case of the consumer and lead to unstable or inconsistent behavior and might affect composability of modules: one traps, one runs slow, one runs non-deterministic. This would affect compatibility and stability of modules, they would no longer be pluggable without reading the fine print of its leaflet.

I'm not sure if host provided intrinsic imports really is a large improvement for the downsides it introduces.

IIUC by downsides (I only see one) do you mean the time and effort needed to spec and implement a way to guarantee inlining of compile-time static function imports or macros? I don't see any other downsides and if we could find a very simple MVP this single downside could shrink to a feasible size and would also be useful for other cases where inlining function calls or macro expansion could improve performance.

In summary, I see your concern regarding the implementation effort, effectively blocking this proposal on a proposal that does not yet exist, so it's probably reasonable to start with the if/else approach. However I could see this problem arising again in similar cases in the future and it might motivate to start working on such a proposal that allows very pragmatically to inline imported function calls or macros.

Answer 14 · 2022-11-02T17:34:46.000Z

I agree it moves non-determinism around, but allows the module consumer (which is recursive down to the end user) to pick the tradeoff that actually has knowledge of the use case instead of forcing the implementer to take a decision oblivious of the use case. Also, the consumer can set a reasonable default strategy for their use case so modules behave consistently or even override their default and it would avoid unexpected runtime traps as this would be known at compile time.

This is not the same as it puts the choice in the hands of the module consumer and retains predictable composability of modules as opposed to forcing the implementer to take a decision in a situation where all options have downsides and their pick might be inadequate to the use case of the consumer ...

I really like this framing. Thanks for articulating it that way.

Answer 15 · 2022-11-02T17:48:42.000Z

It worth to add, before anyone will mention partial function recompilation or code patching, that current implementation works closely with compiler's register allocator: to fit registers into instruction limitations (e.g. PBLENDVB, needed for laneselect, requires XMM0 be available), and to reserve temporary registers.

Two ideas: 1. Defer compiling Wasm functions...., 2... and recompile (to be indirect calls) if speculation was wrong...

Both ideas are really bad for compiled code caching.

Answer 16 · 2022-11-02T17:51:38.000Z

I agree it moves non-determinism around, but allows the module consumer (which is recursive down to the end user) to pick the tradeoff that actually has knowledge of the use case instead of forcing the implementer to take a decision oblivious of the use case. Also, the consumer can set a reasonable default strategy for their use case so modules behave consistently or even override their default and it would avoid unexpected runtime traps as this would be known at compile time.

I am not sure what this means, we are talking about generating calls to imported functions (with inlining guarantees) instead of instructions. By "user" do you mean the runtime or the developer writing the module or site using it? And what is "knowledge of the use case" - the rest of the module, platform it is running on, etc? And how those option differ for the current approach?

Answer 17 · 2022-11-02T18:01:02.000Z

It worth to add, before anyone will mention partial function recompilation or code patching, that current implementation works closely with compiler's register allocator: to fit registers into instruction limitations (e.g. PBLENDVB, needed for laneselect, requires XMM0 be available), and to reserve temporary registers.

Sure, the recompilation unit would be the be the method and the intrinsification would happen early in parsing the Wasm bytecode (or soon after) so that the node in the graph was indeed an intrinsic and made its way through the compilation pipeline just as if it had originally been an actual bytecode.

Two ideas: 1. Defer compiling Wasm functions...., 2... and recompile (to be indirect calls) if speculation was wrong...

Both ideas are really bad for compiled code caching.

In 1, the browser could cache the unspecialized module, thus only generating code for the likely small number of functions that called those speculative imports. In 2, it could cache either the speculatively optimized one or both.

With option 2, all the costs (not compiling early enough, not caching the inlined code, etc) would be pushed to the mismatch case.

Answer 18 · 2022-11-02T18:08:23.000Z

I am not sure what this means, we are talking about generating calls to imported functions (with inlining guarantees) instead of instructions. By "user" do you mean the runtime or the developer writing the module or site using it?

I use these terms to illustrate, but these concepts are more general in the case of Wasm.

An end user I would say sits at the end of the chain of consumers, for example in one scenario would be a person actually invoking the final program, e.g. say it's a CLI tool it's the person invoking the module in a wasm runtime with certain parameters.

With module consumer I mean another module or process importing (consuming) this module (in our case containing SIMD instructions), which if there are several levels would be a chain of direct and levels of indirect consumers up to the final end user invoking the "program".

And what is "knowledge of the use case" - the rest of the module, platform it is running on, etc? And how those option differ for the current approach?

Knowledge of the use case is gradual, every layer of imports adds more information about it. At the level of the end user one might or might not have very specific knowledge about the platform, program and problem, at a higher level you might just know more about the outer module that uses the module containing SIMD instructions.

Answer 19 · 2022-11-02T18:37:55.000Z

Do you expect consumer module to be able to affect SIMD intrinsic calls in the module it consumes? This definitely would make intrinsics more attractive, though I am not sure what the mechanics of that would be.

Answer 20 · 2022-11-02T18:41:49.000Z

I missed this part:

And how those option differ for the current approach?

I think in the current approach (I assume this is the if/else case) the module would detect if the platform supports this instruction or not - unclear to me how that feature detection works, could be a built-in check or a flag also imported from the outside.

If feature detection works by some built-in mechanism the module consumer might have no say over whether this particular module is allowed to use relaxed SIMD or not and no way to choose how fallback is polyfilled (non determinism, trap, slow path). Maybe the runtime provides a global option whether relaxed SIMD is on or off, the consumer might or might not have control over this depending on their role.

With an import flag (which in this scenario should be standardized) at least the consumer could say disable relaxed SIMD for this module. With the import flag I wonder if we would have the same problem that the if/else branch cannot be optimized away if this is not known at compile time. This will be less costly than a function call I suppose, but would still introduce an overhead if the engine is not able to optimize it away at a later stage.

In any case, the main difference is even if the consumer is allowed to take a binary yes/no choice (globally or locally) if SIMD is used they do not have a say on what the fallback strategy would be used at all nor would the consumer be able to sync modules to agree on a strategy - that is the main difference as I see it.

Do you expect consumer module to be able to affect SIMD intrinsic calls in the module it consumes? This definitely would make intrinsics more attractive, though I am not sure what the mechanics of that would be.

Yes. The consumer would have total control over what calling a relaxed SIMD instruction would mean, it could just inline the pure virtual instruction (which is an overloaded term, not intended to use the C++ meaning) or put an arbitrary whole block of code there (maybe a fallback or a switch).

Answer 21 · 2022-11-02T20:14:27.000Z

There is a spectrum of possibilities to achieve the effect of emitting the "one right instruction", short of the module linking/pre-imports idea.

I like the pragmatic idea of hinting at the need for inlining by the name of a relaxed SIMD instruction as an import name to unblock this proposal moving forward before generalizing the approach. From this it would not be far to an "inline" keyword in front of the imported function name to make it generic.

AFAICT both of these options require no spec changes, no new APIs, and no new Wasm mechanisms. In particular, with instantiateStreaming(), the imports are actually known at compile time. We could even recommend that modules that want relaxed SIMD should prefer to be instantiated through streaming or through the instantiate() overload that takes the raw module bytes.

Another thing worth noting is that we could approximate staging at the JS API level by allowing binding only a subset of imports, with the result being a module with residual imports, rather than an instance.

Sounds like a simple and effective idea - in part what I meant when proposing a second parameter for compileStreaming for some of the imports, but if for an intermediate solution we would get this sooner or at all before elaborating on the approach, this is all the more welcome.

Answer 22 · 2022-11-02T22:39:43.000Z

I think if we were to evaluate doing "a single native instruction as a function import" as proposed here we should also consider the next step: "multiple native instructions as a function import." What I mean is: given the difficulty to design a mechanism for inlining certain function calls to a single instruction and convincing all of the engine teams to maintain native lowering code for these special intrinsic imports, it might make more sense to just expose entire blocks of functionality as function imports instead. E.g., in the XNNPACK case, kernels (read: instruction sequences) are composed together to do something interesting. If interested engines were to instead maintain a set of kernels, then presumably an ML application could call (normal calls, not inlined) a series of these imported functions. In this scheme, there is no need for an intrinsic inlining mechanism and kernels could use the full capabilities of the host system (full vector widths not necessarily limited to v128, e.g.); there would be less friction about introducing new instructions to WebAssembly and non-determinism would be the host/engine responsibility.

I would propose that this scheme, "multiple native instructions as a function import," is essentially what WASI is today. For the XNNPACK example, wasi-nn has considered in the past how it could support separate kernels like I described above and perhaps that could be explored further. WebNN is doing something similar here and perhaps there is some overlap. What problems do I see with this approach? 1) WASI is mainly a standalone engine thing these days, though I am hopeful that one day WASI programs could be run in both standalone and browser environments, and 2) we would need a mechanism for optional function imports, so that modules could detect engines that do not support the special kernel imports and run an unoptimized version of the kernel in WebAssembly itself.

This "WASI" talk may be a bit off topic for this issue but I thought it was a different perspective worth thinking about. I also want to see WebAssembly code using the best a host system can offer but maybe that could be at a level higher than a single instruction?

Answer 23 · 2022-11-02T22:53:26.000Z

I think if we were to evaluate doing "a single native instruction as a function import" as proposed here we should also consider the next step: "multiple native instructions as a function import."

Fully agree, as the earlier issue renaming back and forth and mention of fallbacks suggested, if that is a similar low hanging fruit this would definitely be preferable. I assume it should be the compiler engineers to agree on what is feasible as a MVP.

My main motivation is not SIMD per se but avoid Wasm being at a systemic disadvantage compared to native code which makes adopting it more of an investment risk as it limits adoption. While the core common denominator should be generic and kept intact, a way to gradually extend capabilities in the spirit of WASI would be powerful and potentially make Wasm on par or exceed native code imho.

Answer 24 · 2022-11-02T23:34:27.000Z

A lot has been said about consumer control over lowering of Relaxed SIMD instructions, and I would like to see a use-case where this is useful. To be specific, lets consider several scenarios:

WAsm module is part of open-source software (e.g. TensorFlow.js + XNNPack), consumer wants maximum performance -> consumer builds the WAsm module with -mrelaxed-simd.
WAsm module is part of open-source software (e.g. TensorFlow.js + XNNPack), consumer wants maximum determinism -> consumer builds the WAsm module without -mrelaxed-simd.
WAsm module is part of closed-source software. Consumer can't change whether Relaxed SIMD is used in the WAsm module because they can't recompile it, and also can't change how Relaxed SIMD instructions are imported because they don't have the source for the JavaScript component that imports the WAsm module.

In all situations, there is no difference in Relaxed SIMD operations being WAsm opcodes or imported functions: if the consumer has the source code, they can modify it to avoid Relaxed SIMD, and if they don't, they can't modify how it is loaded either.

Answer 25 · 2022-11-03T02:04:50.000Z

A lot has been said about consumer control over lowering of Relaxed SIMD instructions, and I would like to see a use-case where this is useful. To be specific, lets consider several scenarios:

WAsm module is part of open-source software (e.g. TensorFlow.js + XNNPack), consumer wants maximum performance -> consumer builds the WAsm module with -mrelaxed-simd.

WAsm module is part of open-source software (e.g. TensorFlow.js + XNNPack), consumer wants maximum determinism -> consumer builds the WAsm module without -mrelaxed-simd.

I see what you mean, since this will get you quite far and is familiar for native developers. IIUC your example assumes two versions of the .wasm module, one with and one without relaxed SIMD. Unclear if this would be compiled from its source language or if it would already be a .wasm module importing other .wasm modules. In a future with precompiled components written in different languages pulled from a registry, we must assume the latter a binary .wasm module as the input format to compilation as the format of interchange and linking. If the imported module contains relaxed SIMD instructions, the consuming module must somehow be able to check if these are allowed and abort compilation or linking if they are forbidden and you want to prevent a runtime trap.

In an unsupported environment where the instruction / capability is not supported by the cpu or the business logic requires to enforce determinism, importing a module with forbidden instructions must fail during compilation or linking. In practice, probably a runtime would have SIMD disabled globally by default in this case and compiling or linking modules containing forbidden instructions would fail. In a more fine grained approach one could set this flag during import or pre-compilation of the module so in the same program the blockchain or smart contract could be deterministic but the 3d graphics module run fuzzy fast for example which already in part is what is proposed, minus the replaceable implementation.

In a supported environment where the instruction / capability is supported and the business logic does not require determinism, high and predictable performance will often be mission critical at the expense of precision. This means the results will be non deterministic, but fast on all platforms that support the instruction, but not run at all on platforms not supporting the instruction or slow if there is a fallback.

This case is more subtle: If you wanted to be fast and portable, you will want to take advantage of all advanced capabilities but need to implement a fallback strategy (like is common in 3d engines) so in this example you would have to ship two versions of your module separately. Recursively, a fallback version would require the consumer to write two versions of their module as well. For a single flag this might still be feasible, though say there are n such flags you would have to compile 2^n different module combinations to support different sets of non portable instructions or capabilities, so a simple compiler flag on source level will make things hard, the code paths should rather be built into or outsourced and injected into the wasm module and chosen when compiling it into or linking it with other wasm modules. So I think the downside of this approach is forcing a combinatory explosion of wasm module variants to be maintained that will trickle through the chain of consumer modules using it and is only feasible for a very small number of feature flags.

In a less complex scenario like native modules on NPM you can observe the painpoints that arise when compilation breaks when the combination of a certain os and architecture fails to be provided as a precompiled binary and the triggered compilation from source reveals the fragility of the native toolchains below and the whole project refuses to compile. If there will be a Wasm component registry, like in the Javascript world with NPM, people hopefully will not have to invest time compiling modules from source regularly anymore as it is time consuming, so there would be a module x.wasm and x.relaxed-simd.wasm which the package registry is hopefully able to pull like currently on NPM for different cpu architectures and os variants but a real painpoint in practice in those cases where a popular module like node-sass (which is no longer supported therefore and replaced by a scripting language rewrite) and does not provide a binary and fails to build from source.

One could say this compiler flag instead of on source level code operates already on .wasm modules picking the right branch or nested module (fast or fallback), which would be a step better as not two separate .wasm files have to be shipped. This would solve most of the issue but still force the developer to implement fallback strategies matching the different platform capabilities and constraints, which also change over time and make maintaining those difficult, if the fallback strategy even matches what the consumer needs for its use case.

However like @abrown mentioned, maybe instead of single instructions you would rather want to import a hardware accelerated kernel, which is not a single instruction and might be represented by different sets of instructions on different architectures, so you could not really pin a single instruction to lower this down to and maybe an older module uses a kernel that is now supported by hardware accelerated instructions, all modules using a common imported highly optimized kernel could be updated to use the new instructions just by updating the import, making maintaining and upgrading code to new hardware capabilities much easier, given it's sliced right and leave this highly specialized code encapsulated in a common import that experts can maintain, make SIMD more accessible and also simplify the maintenance and migration of existing code to new platforms and capabilities for experts alike. This would also reduce the burden of module implementers to implement their custom fallback strategy themselves, as they could rely on an import that either polyfills the single instruction they need or the whole kernel, which would make newer SIMD instructions easier to adopt.

WAsm module is part of closed-source software. Consumer can't change whether Relaxed SIMD is used in the WAsm module because they can't recompile it, and also can't change how Relaxed SIMD instructions are imported because they don't have the source for the JavaScript component that imports the WAsm module.

If it is a Wasm module then it has been compiled from source or cross compiled from native code, so a code migration tool in the style of wasm-ld could operate on the wasm bytecode search for instances of the SIMD opcodes and expose them as imports.

In all situations, there is no difference in Relaxed SIMD operations being WAsm opcodes or imported functions: if the consumer has the source code, they can modify it to avoid Relaxed SIMD, and if they don't, they can't modify how it is loaded either.

Modifying source code would turn an import configuration into a code migration task on source level, which the module consumer might not be able to due to time constraints, limited knowledge of the language, toolchain and project. If this falls within familiar territory and is within reach, still this will turn minutes into potentially days, weeks or months of understanding, adapting and recompiling the source code from scratch.

For me one of the value propositions of Wasm is that at some point you might not have to spend a week on building a project or dependency from source and could still run code on all platforms progressively taking advantage of their capabilities even if that is not too popular on the native side yet, it is one of the reasons scripting languages are so popular. While these have their downsides as well, Wasm could pick the best of worlds.

Answer 26 · 2022-11-09T13:31:44.000Z

Here is a proposal for clean variant of SIMD with compiler flags using profiles that in combination with a separate proposal for inlining calls of imported functions would allow a future of injecting inlined platform specific instruction blocks without the downsides of shipping separate module variants described above:

In a future described in the profiles proposal I imagine the runtime will have a overridable default which platform specific instructions are allowed and how they should behave. This profile should be visible in the module maybe as standardized named constants for feature detection. The runtime's compile/instantiate function would use this profile as a default but it could be overridden in the specific call of the compile/instantiate function if not all modules have the same requirements.

A .wasm module compiled with a specific profile would replace the need for a compiler flag in the native toolchain on source level. Thus, the wasm compiler can detect on the wasm side if the feature is present and select and optimize the platform specific code path. The result would be a module compiled with or without relaxed SIMD instructions - but from a wasm binary instead of from source code, while the .wasm binary compiled from source code contains both code paths, which avoids the need to ship two module binaries as it's a simple macro like if/else. With this, in the scenario @Maratyszcza described having a relaxed SIMD compiler flag, the flag would be part of a wasm runtime profile and part of a wasm only toolchain.

The proposal of injecting platform specific instructions could be separated and generalized into a proposal for a keyword to guarantee the inlining of imported functions: If we allow to import functions that are guaranteed to be inlined, this could be all we need to be used to import a single built-in platform specific instruction or a generic function with a block of these instructions like a computing kernel.

As established, the function import is provided by the consumer, being another module or the host runtime, which moves the problem one or more levels up the chain of consumers, so the question remains: Where do the platform specific instructions originate? I think it would originate from a module compiled with a profile allowing those as described above. Now if you have a module that contains relaxed SIMD instructions and exports a kernel function, this function could be provided as regular import to a module that is agnostic of the platform specific feature like SIMD and just needs a fast function with the specified semantics. If one wanted to provide a polyfill for a module that only has a SIMD path, one could use a code migration tool as described above that rewrites the module to import the platform specific instruction as a regular import, so that would not be ruled out.

I don't really see a downside to my earlier proposal if the compiler flags are implemented with profiles as described above other than that it might be that modules might not take advantage of making platform agnostic modules with platform specific instructions, but in this case a code migration tool might help to patch them on the wasm side so they're at least usable. I would be interested in feedback on this proposal and if there is interest to create a separate proposal for a keyword to guarantee inlining of calls to specific imported functions which would unlock the scenario of injecting platform specific code in platform agnostic modules and would probably also be interesting for a lot of other cases where the lack of inlining introduces significant overhead as described in @RossTate presentation for example, but also in general for calls across modules which become more relevant the smaller the modules become which would be a trend if we want an ecosystem of reusable, composable modules or components which are lightweight and low overhead.