Benchmarks

Hi @yamt ,

it is really cool that you put so many Wasm runtimes on your benchmarks for comparison!
I have a few questions though.
What hardware did you run the benchmarks on? Would be cool if you could write that down somewhere for reproducibility.
Also I saw that wasmi is included in the script but not showing in the README.md.
Did wasmi not work? If it works I'd be interested in the numbers on your machine. :)
Unfortunately running the benchmark script requires quite a setup.

Hi @yamt ,

it is really cool that you put so many Wasm runtimes on your benchmarks for comparison!

thank you.
but honestly speaking i feel i added too many runtimes.
the purpose of the benchmark was to give a rough idea about toywasm performance.
a few runtimes were enough.

I have a few questions though. What hardware did you run the benchmarks on? Would be cool if you could write that down somewhere for reproducibility. Also I saw that wasmi is included in the script but not showing in the README.md. Did wasmi not work? If it works I'd be interested in the numbers on your machine. :) Unfortunately running the benchmark script requires quite a setup.

i run it on my macbook. (MBP 15-inch 2018)
the latest wasmi works. (at least to complete the benchmark)
it will be included when i happen to run it again.

but honestly speaking i feel i added too many runtimes.

Yeah maybe you did. I think this benchmark with so many different runtimes could even be extended into its own project/repo.

Btw. in order to highlight the strength of the approach you took in your toywasm you should definitely also provide benchmarks with other metrics. Namely memory consumption and startup times. I think your Wasm runtime which uses the raw Wasm bytecode with just little instrumentation should be fairly good in these two cathegories and may stand out. There certainly are use cases preferring those metrics over execution speed.

it will be included when i happen to run it again.

done. (actually i just pushed an unpushed result i found in local repo.)

but honestly speaking i feel i added too many runtimes.

Yeah maybe you did. I think this benchmark with so many different runtimes could even be extended into its own project/repo.

maybe. i myself am not interested in maintaining such a thing right now though.

Btw. in order to highlight the strength of the approach you took in your toywasm you should definitely also provide benchmarks with other metrics. Namely memory consumption and startup times. I think your Wasm runtime which uses the raw Wasm bytecode with just little instrumentation should be fairly good in these two cathegories and may stand out. There certainly are use cases preferring those metrics over execution speed.

thank you for your insights. i agree.

Btw. in order to highlight the strength of the approach you took in your toywasm you should definitely also provide benchmarks with other metrics. Namely memory consumption and startup times. I think your Wasm runtime which uses the raw Wasm bytecode with just little instrumentation should be fairly good in these two cathegories and may stand out. There certainly are use cases preferring those metrics over execution speed.

thank you for your insights. i agree.

i added a benchmark about startup time and memory consumption.
https://github.com/yamt/toywasm/blob/master/benchmark/startup.md

Thank you for those benchmarks!
They are really insightful. Seems like there is some room for improvement for wasmi.
Your toywasm looks very strong! :)

@yamt thanks to your memory consumption benchmarks I took a better look at wasmi's internal bytecode and was able to make it more space efficient while keeping the original performance or even slightly boosting it for version 0.30.0. This also improved translation time (startup time). Thanks again for that initial spark! :)

@yamt thanks to your memory consumption benchmarks I took a better look at wasmi's internal bytecode and was able to make it more space efficient while keeping the original performance or even slightly boosting it for version 0.30.0. This also improved translation time (startup time). Thanks again for that initial spark! :)

it's good to hear! thank you for letting me know.

noted for the next run of the benchmark.
(probably not too distant future, as i want to see how toywasm regressed by the recent simd addition.)

noted for the next run of the benchmark.

looking forward :)

(probably not too distant future, as i want to see how toywasm regressed by the recent simd addition.)

Oh wow, that's super interesting news since toywasm is also an interpreter just like wasmi and I always decided against implementing SIMD in wasmi since I felt it would just slow down the entire interpreter for not too many actual gains. However, proper benchmarks to verify or disproof this "feeling" is always the best! :)

Having a Wasm runtime that is up to date with the standardised proposals is obviously very nice.

noted for the next run of the benchmark.

looking forward :)

(probably not too distant future, as i want to see how toywasm regressed by the recent simd addition.)

Oh wow, that's super interesting news since toywasm is also an interpreter just like wasmi and I always decided against implementing SIMD in wasmi since I felt it would just slow down the entire interpreter for not too many actual gains. However, proper benchmarks to verify or disproof this "feeling" is always the best! :)

Having a Wasm runtime that is up to date with the standardised proposals is obviously very nice.

i have a similar feeling. but i added it mainly for completeness.

having said that, these "do more work per instruction" style instructions can be rather friendly to
interpreters like toywasm because it can hide instruction-fetch-parse overhead.

having said that, these "do more work per instruction" style instructions can be rather friendly to interpreters like toywasm because it can hide instruction-fetch-parse overhead.

At Parity we even found that generating Wasm from Rust source is slightly smaller when enabling Wasm SIMD which is obviously great since translation time can be significant for some practical use cases and it usually linearly scales with Wasm blob size.

I assume you used 64-bit cells for the value stack before the introduction of SIMD to toywasm. If that's the case, have you simply increased cell size to 128-bit to fit 128-bit vectors from SIMD or do those 128-bit vectors not occupy 2 cells. The latter design is probably more complex but would probably result in fewer regressions of non-SIMD instruction execution and memory consumption overall.

having said that, these "do more work per instruction" style instructions can be rather friendly to interpreters like toywasm because it can hide instruction-fetch-parse overhead.

At Parity we even found that generating Wasm from Rust source is slightly smaller when enabling Wasm SIMD which is obviously great since translation time can be significant for some practical use cases and it usually linearly scales with Wasm blob size.

i guess it uses llvm as a backend?
llvm seems to use simd instructions in some interesting ways.

I assume you used 64-bit cells for the value stack before the introduction of SIMD to toywasm. If that's the case, have you simply increased cell size to 128-bit to fit 128-bit vectors from SIMD or do those 128-bit vectors not occupy 2 cells. The latter design is probably more complex but would probably result in fewer regressions of non-SIMD instruction execution and memory consumption overall.

in toywasm, the value stack cell size depends on build-time configurations.

before simd, there were two configurations:

32-bit cell, i64 uses two cells. (default)
64-bit cell, any values use a cell. (faster in many cases. i suppose wasmi UntypedValue works similarly to this.)

after simd, there are three:

32-bit cell, i64 uses two cells, v128 uses four cells. (still default)
64-bit cell (when simd is disabled)
128-bit cell (when simd is enabled)

in toywasm, the value stack cell size depends on build-time configurations.

ah, that's very interesting and perfect for research about what cell size is the best for which use case. :) how much complexity did this add to the interpreter compared to having for example fixed 64-bit cell sizes?

i suppose wasmi UntypedValue works similarly to this.

yes, that's correct.

very interesting approach and looking forward to all the results that you are going to pull off of this. :)

in the past I have been using wasm-coremark to test basic performance of computations for wasmi in comparison with Wasmtime and Wasm3 using https://github.com/Robbepop/wasm-coremark-rs/tree/rf-update-vms-v2.

however, this is a rather artificial benchmark and probably less ideal than your ffmpeq and spidermonkey testcases. what I found out is that the runtime performance of all 3 Wasm runtimes were extremely dependent on the underlying hardware. For example, wasmi performs quite okay on Intel CPUs and super poorly on M1/2 whereas both wasmi and Wasm3 furthermore perform pretty badly on AMD CPUs. And Wasmtime performs way better on AMD CPUs than on Intel or M1/2.

Given these severe differences I think it is kinda important to tag your own benchmark results for reproducibility with the hardware (mainly CPU) and OS used.

in toywasm, the value stack cell size depends on build-time configurations.

ah, that's very interesting and perfect for research about what cell size is the best for which use case. :) how much complexity did this add to the interpreter compared to having for example fixed 64-bit cell sizes?

originally toywasm was using fixed 64-bit cells.
later i added TOYWASM_USE_SMALL_CELLS option to use small (32-bit) cells.
you can see the code blocks ifdef'ed on this macro to see how much complexity is involved.

besides that, i introduced TOYWASM_USE_RESULTTYPE_CELLIDX and TOYWASM_USE_LOCALTYPE_CELLIDX to speed up by-index value stack accesses like local.get.
(when using small cells, local.get imm somehow needs to be converted to the corresponding location of the cell(s).)
you can consider them as a part of TOYWASM_USE_SMALL_CELLS as well.

i suppose that it can be simpler for "translating" interpreters like wasmi because
you can embed many of these pre-calculated information into the translated internal opcodes themselves.

i suppose wasmi UntypedValue works similarly to this.

yes, that's correct.

very interesting approach and looking forward to all the results that you are going to pull off of this. :)

in the past I have been using wasm-coremark to test basic performance of computations for wasmi in comparison with Wasmtime and Wasm3 using https://github.com/Robbepop/wasm-coremark-rs/tree/rf-update-vms-v2.

however, this is a rather artificial benchmark and probably less ideal than your ffmpeq and spidermonkey testcases. what I found out is that the runtime performance of all 3 Wasm runtimes were extremely dependent on the underlying hardware. For example, wasmi performs quite okay on Intel CPUs and super poorly on M1/2 whereas both wasmi and Wasm3 furthermore perform pretty badly on AMD CPUs. And Wasmtime performs way better on AMD CPUs than on Intel or M1/2.

Given these severe differences I think it is kinda important to tag your own benchmark results for reproducibility with the hardware (mainly CPU) and OS used.

interesting. i haven't thought about cpu differences much.
all my benchmarks are with:

ProductName:    macOS
ProductVersion: 12.6.5
BuildVersion:   21G531
MacBook Pro (15-inch, 2018)
2.2 GHz 6-Core Intel Core i7

What just crossed my mind about cell sizes and SIMD support is the following: Maybe it is practical and efficient to have 2 different stacks, e.g. one stack with 64-bit cells and another stack with 128-bit cells. Both stacks are used simultaneously (push, pop) but exclusively for non-SIMD and SIMD instructions respectively. Due to Wasm validation phase and type checks it should probably be possible to support SIMD without touching the already existing stack and not using this 128-bit cell stack at all (and thus not affecting non-SIMD code) when not using SIMD instructions.

Maybe I am overlooking something here. Although if this was efficient I assume it might introduce less complexity than different cell sizes or having SIMD instructions use 2 cells instead of 1. I am way into speculation here. Implementation/time needed to confirm haha.

What just crossed my mind about cell sizes and SIMD support is the following: Maybe it is practical and efficient to have 2 different stacks, e.g. one stack with 64-bit cells and another stack with 128-bit cells. Both stacks are used simultaneously (push, pop) but exclusively for non-SIMD and SIMD instructions respectively. Due to Wasm validation phase and type checks it should probably be possible to support SIMD without touching the already existing stack and not using this 128-bit cell stack at all (and thus not affecting non-SIMD code) when not using SIMD instructions.

Maybe I am overlooking something here. Although if this was efficient I assume it might introduce less complexity than different cell sizes or having SIMD instructions use 2 cells instead of 1. I am way into speculation here. Implementation/time needed to confirm haha.

it's an interesting idea.
random thoughts:

i guess you can even separate stack for 32-bit and 64-bit.
v128 values with 128-bit alignment is a nice property for at least certain cpus.
function parameters/results might be a bit tricky to implement with the approach.
the height for each stacks need to be tracked for possible unwinding. (eg. br)
i'm not sure if it's less or more complex as a whole.

noted for the next run of the benchmark.

looking forward :)

i rerun the benchmarks:
https://github.com/yamt/toywasm/blob/master/benchmark/ffmpeg.md
https://github.com/yamt/toywasm/blob/master/benchmark/startup.md

wasmi has been improved a lot since the last time. (0.27.0)
good work!

Awesome work @yamt and thanks a ton for those benchmarks! 🚀

I am especially fond of the fact that there is nearly no difference between toywasm (SIMD) and toywasm (no SIMD) so maybe fixed 128-bit cells are the way to go and not at all super terrible? 🤔 Obviously they consume a bit more memory but even that difference isn't all too significant imo.

Looks like a very successful research conclusion to me for your SIMD implementation in toywasm! :)

Concerning wasmi performance: The optimizations I have implemented lately cannot explain this extreme difference so I rather think that maybe the wasmi 0.27.0 version got released without proper optimizations enabled. Unfortunately there is a bug in Cargo (the build tool) that requires manual handling for this to happen and sometimes I forget about this when releasing. 🙈 But still the startup time improvement is quite nice. :)

Awesome work @yamt and thanks a ton for those benchmarks! 🚀

I am especially fond of the fact that there is nearly no difference between toywasm (SIMD) and toywasm (no SIMD) so maybe fixed 128-bit cells are the way to go and not at all super terrible? 🤔 Obviously they consume a bit more memory but even that difference isn't all too significant imo.

Looks like a very successful research conclusion to me for your SIMD implementation in toywasm! :)

i guess ffmpeg.wasm (or, probably any C programs) is rather linear-memory intensive than value stack.

Concerning wasmi performance: The optimizations I have implemented lately cannot explain this extreme difference so I rather think that maybe the wasmi 0.27.0 version got released without proper optimizations enabled. Unfortunately there is a bug in Cargo (the build tool) that requires manual handling for this to happen and sometimes I forget about this when releasing. 🙈 But still the startup time improvement is quite nice. :)

hmm. wrt 0.27.0, it might be an error on my side.
i manually built both versions of wasmi locally as:
https://github.com/yamt/toywasm/blob/master/benchmark/notes.md#wasmi

Ah I thought your were simply installing wasmi via cargo install wasmi_cli.
The Cargo bug that not all optimizations are properly applied mostly affects certain binaries installed via cargo install. For wasmi version 0.30.0 I made sure the optimizations are applied when installing via cargo install. :)

wasmi is heavily depending on lto="fat" as well as codegen-units=1 optimization configs. Without them wasmi performance easily drops by 100% in some cases (or even more in others).
I just checked the wasmi Cargo.toml and it seems that if you are building wasmi like this then these optimizations are almost certainly not applied. I should probably change the default --release build here but I was not expecting people to build wasmi from sources. My fault. The default is without those optimizations enabled since they significantly increase the build time of wasmi so I usually only enable them for benchmarks or releases.

Ah I thought your were simply installing wasmi via cargo install wasmi_cli. The Cargo bug that not all optimizations are properly applied mostly affects certain binaries installed via cargo install. For wasmi version 0.30.0 I made sure the optimizations are applied when installing via cargo install. :)

things like cargo install go install etc scare me a bit. :-)

wasmi is heavily depending on lto="fat" as well as codegen-units=1 optimization configs. Without them wasmi performance easily drops by 100% in some cases (or even more in others). I just checked the wasmi Cargo.toml and it seems that if you are building wasmi like this then these optimizations are almost certainly not applied. I should probably change the default --release build here but I was not expecting people to build wasmi from sources. My fault. The default is without those optimizations enabled since they significantly increase the build time of wasmi so I usually only enable them for benchmarks or releases.

it reminded me that, while i wanted to use lto=full for toywasm, cmake insisted to use lto=thin.

toywasm/lib/CMakeLists.txt

Lines 83 to 87 in 9c88f24

    
           # Note: -flto=full seems to produce considerably faster code 
        
           # than flto=thin for us. However, cmake INTERPROCEDURAL_OPTIMIZATION 
        
           # always use -flto=thin for clang. 
        
           # cf. https://gitlab.kitware.com/cmake/cmake/-/issues/16808 
        
           set_property(TARGET toywasm-lib-core PROPERTY INTERPROCEDURAL_OPTIMIZATION TRUE)

in the meantime, i added a warning about this: 9ee47bf

If you want to benchmark wasmi with full optimizations and build it from sources you can build it via:

cargo build --profile bench

So instead of --release you do --profile bench. :)
Made a pull request to your benchmark docs so it is documented: #39

Ever thought of writing a blog post with all your benchmarks about Wasm runtimes? :D Seems like you could pull off quite a bit of information there.

If you want to benchmark wasmi with full optimizations and build it from sources you can build it via:
cargo build --profile bench
So instead of --release you do --profile bench. :) Made a pull request to your benchmark docs so it is documented: #39

thank you. i commented in the PR.

Ever thought of writing a blog post with all your benchmarks about Wasm runtimes? :D Seems like you could pull off quite a bit of information there.

i have no interest in blog right now.

If you want to benchmark wasmi with full optimizations and build it from sources you can build it via:
cargo build --profile bench

i rerun with it and pushed the results.

i also updated the procedure for wasmer. (it was not clearing cache as i intended.)

Hi @yamt , thanks a lot for updating me about this!

The new wasmi results look much more as I would expect, being roughly twice as slow as Wasm3.

It is interesting that your toywasm has similar startup performance to WAMR classic interpreter but performs way better than it at runtime.
What are your plans forward with toywasm?

Btw.: I am currently working on a new engine for wasmi making it more similar to how Wasm3 and the WAMR fast-interpreter work internally. Looking forward to how it performs when it is done in a few weeks/months. :)

Hi @yamt , thanks a lot for updating me about this!

The new wasmi results look much more as I would expect, being roughly twice as slow as Wasm3.

good.

It is interesting that your toywasm has similar startup performance to WAMR classic interpreter but performs way better than it at runtime. What are your plans forward with toywasm?

actually, it seems that which toywasm or iwasm classic is faster depends on the specific apps to run.
i haven't investigated further. probably i should, sooner or later.

Btw.: I am currently working on a new engine for wasmi making it more similar to how Wasm3 and the WAMR fast-interpreter work internally. Looking forward to how it performs when it is done in a few weeks/months. :)

interesting.
is it this PR? wasmi-labs/wasmi#729

actually, it seems that which toywasm or iwasm classic is faster depends on the specific apps to run.

No runtime can be efficient for all the use cases - at least that's what I learned from working on wasmi. The only hope for a general purpose runtime is to fix all the potential weak spots so that it at least isn't terrible with any potential use case.

Due to your awesome benchmarks I see a huge potential in lazy Wasm compilation for fixing one such weak spot for startup time since a few benchmarked runtimes profit quite a bit because of their lazy compilation and/or Wasm validation.

is it this PR? wasmi-labs/wasmi#729

Yes it is. Although still super WIP at this point. Everything is subject to change. Still trying to figure out best designs for tackling certain problems/challenges. Trade-offs here and there, I just hope all the work will be worth it in the end. I was trying to read code from Wasm3 and WAMR fast interpreter for inspiration for certain problems but in all honesty I find both rather hard to read. Not used to reading high-density C code.

actually, it seems that which toywasm or iwasm classic is faster depends on the specific apps to run.

No runtime can be efficient for all the use cases - at least that's what I learned from working on wasmi. The only hope for a general purpose runtime is to fix all the potential weak spots so that it at least isn't terrible with any potential use case.

sure.

being considerably slower than a similar engine (in my case iwasm classic) for a specific app is likely a sign of weak spots, or even a bug.

Due to your awesome benchmarks I see a huge potential in lazy Wasm compilation for fixing one such weak spot for startup time since a few benchmarked runtimes profit quite a bit because of their lazy compilation and/or Wasm validation.

i wonder how common sparsely-used wasm modules like ffmpeg.wasm are.

is it this PR? paritytech/wasmi#729

Yes it is. Although still super WIP at this point. Everything is subject to change. Still trying to figure out best designs for tackling certain problems/challenges. Trade-offs here and there, I just hope all the work will be worth it in the end. I was trying to read code from Wasm3 and WAMR fast interpreter for inspiration for certain problems but in all honesty I find both rather hard to read. Not used to reading high-density C code.

a lot of interesting ideas in the PR. i'm looking forward to see how it performs.

i wonder how common sparsely-used wasm modules like ffmpeg.wasm are.

I can only talk for the use cases of my employer. We use Wasm in two different ways:

Executing the hot-patchable plugin-like runtime of entire system. This is to allow our users to customize their frameworks. This is usually a longer running process and it is likely that a good chunk of the available functionality is going to be executed eventually. We use Wasmtime as the executor.
Executing smart contracts. Due to the associated cost model developers of smart contracts are rewarded by doing as little as possible upon a call to a smart contract. Therefore a single smart contract execution usually only uses fraction of the entire Wasm blob for its execution, mostly executing a single function. Persistent data is loaded from and stored to the blockchain. Here we use wasmi.

very interesting. thank you for sharing use cases.

Executing smart contracts. Due to the associated cost model developers of smart contracts are rewarded by doing as little as possible upon a call to a smart contract. Therefore a single smart contract execution usually only uses fraction of the entire Wasm blob for its execution, mostly executing a single function. Persistent data is loaded from and stored to the blockchain. Here we use wasmi.

the blob size itself doesn't cost there unless it's actually executed?

the blob size itself doesn't cost there unless it's actually executed?

Users pay for uploading a new smart contract to the blockchain in dependence on the Wasm blob size and they pay again for executing a smart contract but then only for the actual computation, not for the transpilation process where wasmi validates and translates the Wasm bytecode to wasmi bytecode. This requires the transpilation process to be linear time with respect to the Wasm blob size so that malicious attackers cannot attack the system with weirdly formed Wasm blobs that take super-linear amount of time to transpile.

thank you for explanation.

wrt jit-bombing, i have a few crafted wasm modules in https://github.com/yamt/toywasm/tree/master/wat.
from my experience, wamr is very weak in that regard.

wrt jit-bombing, i have a few crafted wasm modules in https://github.com/yamt/toywasm/tree/master/wat.

that's super cool! 🚀
I wish the Wasm standard test suite would include some JIT comb tests.

wamr is very weak in that regard

did you also test Wasmer singlepass? They claim linear time Wasm -> x86 machine code generation and while working on Wasm -> wasmi bytecode translation in the register-machine approach of the new PR I found that to be hard to realize with the Wasm multi-value proposal in certain cases. I hope I will find a solution to the cases I have found so far.

did you also test Wasmer singlepass? They claim linear time Wasm -> x86 machine code generation and while working on Wasm -> wasmi bytecode translation in the register-machine approach of the new PR I found that to be hard to realize with the Wasm multi-value proposal in certain cases. I hope I will find a solution to the cases I have found so far.

no.
i have only used homebrew-installed binary of wasmer, which seems to have only cranelift enabled.

iirc wasmer allows only small numbers of function results. (1000?)
is linearity important even for such small number?

is linearity important even for such small number?

It depends on how bad it is:

Is it O(n*log(n))? With n=1000 it would be roughly 20k iterations so probably not great, not terrible.
Is it O(n^2) or worse? Then n=1000 means 1M iterations which is probably pretty bad.

As a rule of thumb in wasmi we try to avoid anything worse than O(n*log(n)).

is linearity important even for such small number?

It depends on how bad it is:
* Is it O(n*log(n))? With n=1000 it would be roughly 20k iterations so probably not great, not terrible.

* Is it O(n^2) or worse? Then n=1000 means 1M iterations which is probably pretty bad.
As a rule of thumb in wasmi we try to avoid anything worse than O(n*log(n)).

ok. it makes sense.

toywasm at this point prefers simplicity and uses O(n^2) logic in a few places. (eg. import/export handling)

toywasm at this point prefers simplicity and uses O(n^2) logic in a few places. (eg. import/export handling)

While implementing support for Wasm multi-value proposal I found certain cases where I have not found any solution to provide at least O(n*log(n)). Just noticed that Wasmer singlepass with its linear guarantees actually disables Wasm multi-value support. So maybe they also did not find a linear solution to certain cases: https://docs.rs/wasmer-compiler-singlepass/latest/src/wasmer_compiler_singlepass/config.rs.html#43

Instead of simply not supporting Wasm multi-value Wasm proposal in wasmi my idea is to implementing something that I call compilation fuel which you can set to some value linear to the size of your Wasm blob for example and if the Wasm translation phase runs out of this translation fuel the translation is stopped and an error is returned. This way you can still guarantee linear time behavior to users while having non-linear algorithms for certain cases, especially if those cases are considered edge cases or impractical cases.

toywasm at this point prefers simplicity and uses O(n^2) logic in a few places. (eg. import/export handling)

While implementing support for Wasm multi-value proposal I found certain cases where I have not found any solution to provide at least O(n*log(n)). Just noticed that Wasmer singlepass with its linear guarantees actually disables Wasm multi-value support. So maybe they also did not find a linear solution to certain cases: https://docs.rs/wasmer-compiler-singlepass/latest/src/wasmer_compiler_singlepass/config.rs.html#43

i don't understand what's difficult with multi-value.
isn't it almost same as function parameters?
but if you and wasmer dev independently found it difficult, i guess it's difficult. :-)

Instead of simply not supporting Wasm multi-value Wasm proposal in wasmi my idea is to implementing something that I call compilation fuel which you can set to some value linear to the size of your Wasm blob for example and if the Wasm translation phase runs out of this translation fuel the translation is stopped and an error is returned. This way you can still guarantee linear time behavior to users while having non-linear algorithms for certain cases, especially if those cases are considered edge cases or impractical cases.

interesting idea.
at least it sounds nicer than having small hard limits.

i don't understand what's difficult with multi-value.
isn't it almost same as function parameters?

Let me provide you with an example Wasm function that would be problematic according to my personal knowledge:

(func (param i32) (result i32 i32)
    (i32.const 10)
    (i32.const 20)
    (br_if 0 (local.get 0))
    (br_if 0 (local.get 0))
)

Each of the br_if are conditional return instructions. Both need to return 2 i32 values. During translation we need to walk the live value stack to see which of the values or registers we need to return. In a stack machine model we don't have to do this since we already rely on Wasm validation to check for valid stack heights but in a register-machine model we have to do a lightweight register allocation for thing like these.

Now imagine not having just 2 i32 return values but 10_000 and not just 2 br_if but 10_000, then you end up with a quadratic behavior with 10k^2 iterations in total.

There are also examples that evolve block parameters/results and not just function results which are also allowed by the multi-value Wasm proposal. So it is even more complex than this above example demonstrates.

For the simple above example we could probably come up with an optimization but it quickly gets more and more complex so that we cannot really fix this attack vector by implementing more and more specialized optimization variants as you probably can imagine.

i don't understand what's difficult with multi-value.
isn't it almost same as function parameters?

Let me provide you with an example Wasm function that would be problematic according to my personal knowledge:
(func (param i32) (result i32 i32)
    (i32.const 10)
    (i32.const 20)
    (br_if 0 (local.get 0))
    (br_if 0 (local.get 0))
)
Each of the br_if are conditional return instructions. Both need to return 2 i32 values. During translation we need to walk the live value stack to see which of the values or registers we need to return. In a stack machine model we don't have to do this since we already rely on Wasm validation to check for valid stack heights but in a register-machine model we have to do a lightweight register allocation for thing like these.

Now imagine not having just 2 i32 return values but 10_000 and not just 2 br_if but 10_000, then you end up with a quadratic behavior with 10k^2 iterations in total.

There are also examples that evolve block parameters/results and not just function results which are also allowed by the multi-value Wasm proposal. So it is even more complex than this above example demonstrates.

For the simple above example we could probably come up with an optimization but it quickly gets more and more complex so that we cannot really fix this attack vector by implementing more and more specialized optimization variants as you probably can imagine.

thank you for explanation.

when you say quadratic, do you mean O(n*m) where n = number of br_if and m = number of values in the return type?

if so, isn't the wasm validation logic itself already quadratic, regardless of register allocations?
for each of 10_000 br_if, the validation logic needs to check 10_000 types on the top of the stack.
at least it's how the validation logic in toywasm would work.

if so, isn't the wasm validation logic itself already quadratic, regardless of register allocations?
for each of 10_000 br_if, the validation logic needs to check 10_000 types on the top of the stack.
at least it's how the validation logic in toywasm would work.

Ah yeah you are totally right! So it is even worse than I hoped. I haven't had validation logic in mind since wasmi uses the wasmparser crate for parsing and validation which is hosted by the BytecodeAlliance. So maybe even the compilation fuel would not be ideal since compilation happens after validation per bytecode so there is a chance to step over the fuel limit during validation in certain cases. Maybe adding validation fuel to wasmparser could help but that's probably overkill.

Hey @yamt , have you ever taken note of Ben Tizer's Wizard Wasm runtime?

GitHub: https://github.com/titzer/wizard-engine
Youtube talk at WasmCon: https://www.youtube.com/watch?v=TFt6ZjieSvQ&t=9933s

From what I know it uses a similar approach as toywasm trying to interpret Wasm bytecode directly without an intermediate compilation or rewrite step.

Hey @yamt , have you ever taken note of Ben Tizer's Wizard Wasm runtime?
* GitHub: https://github.com/titzer/wizard-engine

* Youtube talk at WasmCon: https://www.youtube.com/watch?v=TFt6ZjieSvQ&t=9933s
From what I know it uses a similar approach as toywasm trying to interpret Wasm bytecode directly without an intermediate compilation or rewrite step.

i have heard of the runtime. but i didn't know anything beyond it was written in an exotic language.
thank you.

the slides seems suggesting their interpreter performance is comparable to the wasm3.
(https://youtu.be/TFt6ZjieSvQ?t=11470)
i'm interested how it's achieved.

unfortunately their wasi support is too incomplete to run the benchmarks i usually use though.

unfortunately their wasi support is too incomplete to run the benchmarks i usually use though.

ah that is very unfortunate!

the slides seems suggesting their interpreter performance is comparable to the wasm3.

I think Wasm3 is still a bit faster, but yes, performance seems to be at least comparable. From what I know it was written in raw assembler to archive this kind of performance. So it is not super portable.

the slides seems suggesting their interpreter performance is comparable to the wasm3.

I think Wasm3 is still a bit faster, but yes, performance seems to be at least comparable. From what I know it was written in raw assembler to archive this kind of performance. So it is not super portable.

wow.

I just released v0.32.0-beta.0 - the beta version of the upcoming Wasmi v0.32.0 release.

This adds the register-machine bytecode based engine executor. Benchmarks so far concluded that it compiles roughly 30% slower and executes roughly 80-100% faster, reaching more or less Wasm3 performance on some instances. Although there are some performance issues with certain machine architectures that need to be fixed before the stable release.

This version also adds lazy function compilation which combined with unchecked Wasm validation (via Module::new_unchecked) speeds up startup times by up to 27 times.

Given that toywasm experiments with faster startup times I wonder if lazy function translation could be interesting for it as well. At least for Wasmi (and Wasm3) it turned out to be extremely successful. Idea: Doing nothing is still faster than doing minimal work. :)

I just released v0.32.0-beta.0 - the beta version of the upcoming Wasmi v0.32.0 release.

This adds the register-machine bytecode based engine executor. Benchmarks so far concluded that it compiles roughly 30% slower and executes roughly 80-100% faster, reaching more or less Wasm3 performance on some instances. Although there are some performance issues with certain machine architectures that need to be fixed before the stable release.

This version also adds lazy function compilation which combined with unchecked Wasm validation (via Module::new_unchecked) speeds up startup times by up to 27 times.

sounds great. i will use that version (or later version) when i run the benchmark next time.

Given that toywasm experiments with faster startup times I wonder if lazy function translation could be interesting for it as well. At least for Wasmi (and Wasm3) it turned out to be extremely successful. Idea: Doing nothing is still faster than doing minimal work. :)

i have no plan to do it in toywasm for now because, in the current implementation, loaded modules are completely read-only and i like it.
but i might change my mind in future.

i have no plan to do it in toywasm for now because, in the current implementation, loaded modules are completely read-only and i like it.

I can understand that. Indeed lazy Wasm validation and translation added a lot of complexity to the codebase. Keeping things simple is also a very valuable strength in software.

For the next benchmark runs: I added support for lazy Wasm validation via wasmi_cli --compilation-mode lazy for version v0.32.0-beta.2 which can be installed with cargo install wasmi_cli --version 0.32.0-beta.2. :)

For the next benchmark runs: I added support for lazy Wasm validation via wasmi_cli --compilation-mode lazy for version v0.32.0-beta.1 which can be installed with cargo install wasmi_cli --version 0.32.0-beta.1. :)

ok!

i have no plan to do it in toywasm for now because, in the current implementation, loaded modules are completely read-only and i like it.

I can understand that. Indeed lazy Wasm validation and translation added a lot of complexity to the codebase. Keeping things simple is also a very valuable strength in software.

For the next benchmark runs: I added support for lazy Wasm validation via wasmi_cli --compilation-mode lazy for version v0.32.0-beta.2 which can be installed with cargo install wasmi_cli --version 0.32.0-beta.2. :)

i tried 0.32.0-beta.5. it didn't work well. wasmi-labs/wasmi#934

i have no plan to do it in toywasm for now because, in the current implementation, loaded modules are completely read-only and i like it.

I can understand that. Indeed lazy Wasm validation and translation added a lot of complexity to the codebase. Keeping things simple is also a very valuable strength in software.
For the next benchmark runs: I added support for lazy Wasm validation via wasmi_cli --compilation-mode lazy for version v0.32.0-beta.2 which can be installed with cargo install wasmi_cli --version 0.32.0-beta.2. :)

i tried 0.32.0-beta.5. it didn't work well. wasmi-labs/wasmi#934

i tried v0.32.0-beta.7. it worked. #143

i have no plan to do it in toywasm for now because, in the current implementation, loaded modules are completely read-only and i like it.

I can understand that. Indeed lazy Wasm validation and translation added a lot of complexity to the codebase. Keeping things simple is also a very valuable strength in software.
For the next benchmark runs: I added support for lazy Wasm validation via wasmi_cli --compilation-mode lazy for version v0.32.0-beta.2 which can be installed with cargo install wasmi_cli --version 0.32.0-beta.2. :)

i tried 0.32.0-beta.5. it didn't work well. wasmi-labs/wasmi#934

i tried v0.32.0-beta.7. it worked. #143

Great! I am very happy that it works now. 🥳
Thanks a lot for sharing your benchmarks!

Although the runtime numbers for Wasmi (~31s) look very off especially when compared to old Wasmi with ~34s. In all practical (non-artificial) tests we conducted benchmarks so far between the two engines we usually found at least 50-150% performance improvement from old Wasmi to new Wasmi.
However, we also saw one incident with our benchmark runners that experienced extreme inefficiency and thus nearly the same performance with both engines, maybe that's what is happening on your machine, too.
The startup performance numbers look as expected, though.

Unfortunately I cannot tell what is or was causing the extreme inefficiency for our old benchmark runners (that looks similar to yours) and if it is even fixable. We know it was an inefficiency problem (or some sort of bug) because we saw that the new Wasmi executed via Wasmtime was actually faster than when run natively which does not make sense at all.

edit: Here is a screenshot of someone benchmarking many different execution engines, amonst others also Wasmi (stack), Wasmi (register) and Wasm3 and as you can see, at least in this independent set of benchmarks Wasmi (register) even outperformed Wasm3 in one of the 2 tests by ~10%. Note though that these benchmarks are already quite dated and a lot improvements have made it into Wasmi (register) since then. So without miscompilation (or whatever is causing the inefficiency) I can expect Wasmi (register) to at least be on par with Wasm3.
https://forum.polkadot.network/t/announcing-polkavm-a-new-risc-v-based-vm-for-smart-contracts-and-possibly-more/3811/52?u=robinf

I really really hope I can find out what is causing those inefficiencies (or miscompilations) on some of those hardware systems. :S

i have no plan to do it in toywasm for now because, in the current implementation, loaded modules are completely read-only and i like it.

I can understand that. Indeed lazy Wasm validation and translation added a lot of complexity to the codebase. Keeping things simple is also a very valuable strength in software.
For the next benchmark runs: I added support for lazy Wasm validation via wasmi_cli --compilation-mode lazy for version v0.32.0-beta.2 which can be installed with cargo install wasmi_cli --version 0.32.0-beta.2. :)

i tried 0.32.0-beta.5. it didn't work well. wasmi-labs/wasmi#934

i tried v0.32.0-beta.7. it worked. #143

Great! I am very happy that it works now. 🥳 Thanks a lot for sharing your benchmarks!

Although the runtime numbers for Wasmi (~31s) look very off especially when compared to old Wasmi with ~34s. In all practical (non-artificial) tests we conducted benchmarks so far between the two engines we usually found at least 50-150% performance improvement from old Wasmi to new Wasmi. However, we also saw one incident with our benchmark runners that experienced extreme inefficiency and thus nearly the same performance with both engines, maybe that's what is happening on your machine, too. The startup performance numbers look as expected, though.

Unfortunately I cannot tell what is or was causing the extreme inefficiency for our old benchmark runners (that looks similar to yours) and if it is even fixable. We know it was an inefficiency problem (or some sort of bug) because we saw that the new Wasmi executed via Wasmtime was actually faster than when run natively which does not make sense at all.

heh.
maybe you crafted code which cranelift compiles better than llvm. :-)
i've seen similar with toywasm on wamr aot.

edit: Here is a screenshot of someone benchmarking many different execution engines, amonst others also Wasmi (stack), Wasmi (register) and Wasm3 and as you can see, at least in this independent set of benchmarks Wasmi (register) even outperformed Wasm3 in one of the 2 tests by ~10%. Note though that these benchmarks are already quite dated and a lot improvements have made it into Wasmi (register) since then. So without miscompilation (or whatever is causing the inefficiency) I can expect Wasmi (register) to at least be on par with Wasm3. https://forum.polkadot.network/t/announcing-polkavm-a-new-risc-v-based-vm-for-smart-contracts-and-possibly-more/3811/52?u=robinf

nice.

I really really hope I can find out what is causing those inefficiencies (or miscompilations) on some of those hardware systems. :S

as you know, performance is sometimes subtle and interesting. i want to know the cause too.

the output of "toywasm --version" includes a version string like "toywasm v43.0.0".
in the past, when i changed it to be a bit longer, eg. "toywasm v43.0.0-foo", it changed coremark number by ~10%.
my wild guess was it was something related to the layout in the executable. but i couldn't find the cause.

maybe you crafted code which cranelift compiles better than llvm. :-)
i've seen similar with toywasm on wamr aot.

Uhhh, that is actually an interesting thought! When testing Wasm performance for Wasmi the pipeline went trough LLVM -> wasm-opt -> Cranelift (via Wasmtime), so this thought didn't occur to me but it for sure is interesting.

My main problem is that I do not have the hardware for which Wasmi compiles badly to tinker with it. On my own machine Wasmi compiles okay and on hitchhooker's machine it compiles even better.

My suspicion is the execution loop that is based on a vanilla loop+match construct which is very fragile to optimizations. I wish I could use a tail-call based dispatcher because it allows for more control over the code but tail calls sadly do not exist in Rust, as of today.

as you know, performance is sometimes subtle and interesting. i want to know the cause too.

I will keep you updated if I ever find out. 🙈

the output of "toywasm --version" includes a version string like "toywasm v43.0.0".
in the past, when i changed it to be a bit longer, eg. "toywasm v43.0.0-foo", it changed coremark number by ~10%.
my wild guess was it was something related to the layout in the executable. but i couldn't find the cause.

Oh wow, that sounds super awkward, too! Optimization in those scenarios can feel a bit cumbersome because it feels like we are missing some knobs for control in certain areas (e.g. executable layout).

	# Note: -flto=full seems to produce considerably faster code
	# than flto=thin for us. However, cmake INTERPROCEDURAL_OPTIMIZATION
	# always use -flto=thin for clang.
	# cf. https://gitlab.kitware.com/cmake/cmake/-/issues/16808
	set_property(TARGET toywasm-lib-core PROPERTY INTERPROCEDURAL_OPTIMIZATION TRUE)