wasmi-labs/wasmi

Wasmi versions `0.32.0-beta.6` cannot run `ffmpeg.wasm` correctly

yamt opened this issue ยท 20 comments

yamt commented
  1. put wasmi_cli in your path https://github.com/yamt/toywasm/blob/master/benchmark/notes.md#wasmi
  2. checkout toywasm https://github.com/yamt/toywasm and cd into the directory
  3. run the following
(venv) spacetanuki% ./test/run-ffmpeg.sh wasmi_cli --dir .video --
executing File(".ffmpeg/ffmpeg.wasm")::_start() ...
.video/video-1080p-60fps-2s.mp4: Invalid data found when processing input
(venv) spacetanuki% 

0.32.0-beta.5 doesn't work.
0.32.0-beta.6 doesn't work.
today's master (352f8ae) doesn't work.
v0.31.0 works.

@yamt thanks a lot for the bug report. I am going to inspect and fix it asap.

Ideally can you provide me with the ffmeq.wasm file that you have used so that I can properly reproduce?

yamt commented

Ideally can you provide me with the ffmeq.wasm file that you have used so that I can properly reproduce?

you can download it from https://assets.runno.dev/ffmpeg/ffmpeg.wasm

the script automatically download it: https://github.com/yamt/toywasm/blob/b73e2ae6403ce18e477a884ab8634d1d7c840dda/test/run-ffmpeg.sh#L48

yamt commented

forgot to note: it's macOS/x86-64 if it matters.

spacetanuki% sw_vers
ProductName:    macOS
ProductVersion: 12.7.2
BuildVersion:   21G1974
spacetanuki% 

Thanks a lot for all the information, @yamt !

I traces loads and stores for both Wasmi (stack) and Wasmi (register) while executing the ffmpeg script and found some differences. Note that for performance reasons I only logged every 100th executed load or store.

On line 812 both files start to differ. Wasmi (stack) interleaves some store instructions whereas Wasmi (register) does not. This helps me to track down where exactly things are going side ways and is the first step towards fixing the issue. Note the swapped [$address + $offset] in the output for some of the logged instructions is not critical from what I saw and just an optimization artifact from Wasmi (register).

edit: I refined bounds and found a more precise point where the executions diverge. The difference starts on line 76 where Wasmi (stack) stores 2450 store mem[12052292 + 0] <- 0x1569F8 and Wasmi (register) stores 2450 store mem[12052292 + 0] <- 0x0.

edit 2: I found the location in the Wasm file that causes the corruption. The i32.store on line 2381225 is divergent in Wasmi (stack) and Wasmi (register). (Can be found via VS Code regex: local.set [\d]*[\n\s]*local.get [\d]*[\n\s]*local.get [\d]*[\n\s]*i32.load offset=76[\s\n]*local.tee [\d]*[\n\s]*i32.ge_s) Particularly the value in local variable with index 7 is diverging on both Wasmi (stack = 0x1569F8) and Wasmi (register = 0x0). This indicates that the call to (func 5401) somehow returns the wrong value.

image

I think I have found the underlying issue. The issue unfortunately is in the Wasm -> Wasmi bytecode translation phase.

Given the following .wat file:

(module
    (func (param i32 i32 i32) (result i32)
        local.get 0
        block (result i32)  ;; label = @1
        local.get 1
        local.get 2
        br_if 0 (;@1;)
        drop
        i32.const 0
        local.set 0
        i32.const 1
        end
        local.get 1
        local.get 2
        i32.lt_s
        local.tee 0
        select
    )
)

Or the equivalent .wat file with folded expressions for better readability:

(module
    (func (param i32 i32 i32) (result i32)
        (select
            (local.get 0)       ;; lhs - uses (local 0) that is conditionally overwritten
            (block (result i32) ;; rhs - conditionally overwriting (local 0)
                (drop (br_if 0
                    (local.get 1) ;; br_if return value
                    (local.get 2) ;; br_if condition
                ))
                (local.set 0 (i32.const 0)) ;; overwrites (local 0) after `br_if` conditionally
                (i32.const 1)
            )
            ;; condition - overwriting (local 0)
            (local.tee 0
                (i32.lt_s (local.get 1) (local.get 2))
            )
        )
    )
)

We see that the select instruction takes 3 different inputs:

  • a condition
  • a left-hand side (condition = true)
  • a right-hand side (condition = false)

While the left-hand side input is trivially just (local.get 0), both right-hand side as well as the condition overwrite (local 0). On top of that the right-hand side input only conditionally overwrites (local 0) thus. creating divergent control flow within the block which has a result value that acts as the right-hand side argument to select.

The problem with our current Wasm -> Wasmi bytecode translation is that it does not properly respect the divergent control flow in this situation and produces the following sequence of Wasmi bytecode instructions:

[
    Instruction::branch_i32_eq_imm(Register::from_i16(2), 0, BranchOffset16::from(3)),
    Instruction::copy(3, 1),
    Instruction::branch(BranchOffset::from(4)),
    Instruction::copy(5, 0),
    Instruction::copy_imm32(Register::from_i16(0), 0),
    Instruction::copy_imm32(Register::from_i16(3), 1),
    Instruction::i32_lt_s(Register::from_i16(0), Register::from_i16(1), Register::from_i16(2)),
    Instruction::select(Register::from_i16(3), Register::from_i16(0), Register::from_i16(5)),
    Instruction::Register(Register::from_i16(3)), // select right-hand side input
    Instruction::return_reg(3),
]

With the following register table:

0 1 2 3 4 5
Parameter 0 Parameter 1 Parameter 2 Dynamic 0 Unused Preservation 0

The select instruction clearly takes Register(5) as left-hand side parameter. However, if branch_i32_eq_imm evaluates to false and does not take the branch Register(5) is never written to since it is only written to by Instruction::copy(5, 0) at index 3 which is jumped over in that case. Thus reading 0x0 instead of a proper value which is exactly what we can observe with the faulty Wasm execution above.

There are 2 potential fixes that come to my mind right now:

  1. Patch the divergent control flow to also include Instruction::copy(5, 0).
  2. Introduce a conditional which selects the correct register for the select instruction in dependence of the chosen divergent path of execution.
  3. Unfold into many select instructions, one for each divergent control flow with each its different inputs.
  4. Move the copy(5, 0) before the block expression since the block expression as a whole overwrites (local 0).

The problems with each of the proposed solutions are these:

    1. This requires book keeping of all divergent paths and which local variables they touch or update. This could interfere with branch instruction offsets. An efficient codegen implementation and avoid branch offset invalidation would require lookahead knowledge of the Wasm bytecode of an entire block or any more complex expression.
    1. This introduces even more state to keep track of and it is not yet clear to me where state introductions and updates need to happen precisely.
    1. Similar as 1) and also likely code bloat.
    1. Similar as 1).

@yamt As one of the WAMR authors I suppose you had to deal with the same issue with the fast interpreter.

Do you think other instructions could be affected, too?

@athei This bug is not tied to any single instruction such as select but connected to improper handling of divergent control flow where one path overwrites a local x and the other partially overwrites x. Also x needs to be the input to an instruction parameter prior to the expression containing this divergent control flow.

So I assume this could have happened with other instructions, too. It just happened to expose itself with the select instruction.

It is interesting that this bug was not discovered earlier with all the Polkadot runtime execution and our fuzzing. So I assume code like this is not encoded very often.

edit: To give an example to your question: Similar control flow could have been achieved with if and br_table, too and maybe even more that I cannot yet think of. Also it could have worked with any binary instruction such as i32.add with carefully chosen inputs for lhs and rhs and foremost with call and its n-ary set of parameters and maybe even with normal control flow of the multi-value Wasm proposal which introduces copy instructions at specific code points copying multiple values. So we need a generalized and proper solution to this problem.

@yamt As one of the WAMR authors I suppose you had to deal with the same issue with the fast interpreter.

i'm not so familiar with the fast interpreter.

a bit simpler function which wasmi seems to misinterpret:

(module
    (func (export "test") (param i32 i32) (result i32)
        local.get 0
        block
            local.get 1
            br_if 0
            i32.const 9
            local.set 0
        end
    )
)
spacetanuki% wasmi_cli --invoke test a.wasm 1 1
executing File("a.wasm")::test(1, 1) ...
0
spacetanuki% toywasm --load a.wasm --invoke "test 1 1"
Result: 1:i32
spacetanuki% iwasm.fast -f test a.wasm 1 1       
0x1:i32
spacetanuki% 

wamr fast interpreter would compile it to:

EXT_OP_COPY_STACK_TOP   0   3   
WASM_OP_BR_IF   1   0   
EXT_OP_SET_LOCAL_FAST   0   -1  
EXT_OP_COPY_STACK_TOP   3   2       
WASM_OP_RETURN  2

where
-1: constant (9)
0..1: locals
2: dynamic
3: preserved

@yamt Indeed that is simpler because it takes the other route of the conditional local set in order to evaluate incorrectly. With my Wasm test cases I wanted to stay as close as possible to the original inputs. How did you make WAMR fast interpreter print the instructions? That's pretty handy. I also wanted to implement something like this in Wasmi with its own bytecode for inspection and debugging purposes.

How did you make WAMR fast interpreter print the instructions? That's pretty handy.

it only has a primitive printf-style logging. (WASM_DEBUG_PREPROCESSOR)
i performed a cumbersome manual post-processing. that's one of reasons why i prefer smaller test cases. :-)

I also wanted to implement something like this in Wasmi with its own bytecode for inspection and debugging purposes.

yea, it would be great to have. almost must to have i suspect.

@yamt In that case thanks a lot for your manual post-processing and providing me with this information to help me fix the bug! :)

In an attempt to explain (also to myself) what is going on currently, what should be done practically and what should be done ideally let us look at the example given by @yamt :

(module
    (func (param i32 i32) (result i32)
        local.get 0
        block
            local.get 1
            br_if 0
            i32.const 10
            local.set 0
        end
    )
)

Translated to Wasmi bytecode WAMR fast-interpreter produces roughly the following bytecode. (Please correct me if I am wrong.)

Instruction::copy(3, 0),
Instruction::branch_i32_eq_imm(Register::from_i16(1), 0, BranchOffset16::from(2)),
Instruction::copy_imm32(Register::from_i16(0), 10_i32),
Instruction::copy(2, 3),
Instruction::return_reg(2),

Note that at this point I do not fully understand the difference or need for difference between WAMR's EXT_OP_SET_LOCAL_FAST and EXT_OP_COPY_STACK_TOP instructions, since to me they both look like copy instructions. Maybe the only reason is to tell the origins apart, but this could technically be done by looking at their data values.

So what WAMR seems to be doing is to analyze which locals are set (via local.set or local.tee) within a block, if or loop upon entering translation of the control structure and preserve all referenced locals before actually entering the control structure. This can be seen with the first instruction that copies the value at register 0 to register 3 which acts as the preservation register. Also it seems that despite the fact that the block of the Wasm function does not return any values, there is the copy from register 3 to register 2 which (to my understanding) acts as the result register. However, a more ideal codegen wouldn't need this copy instructions and simply use register 3 directly instead.

The code that Wasmi currently produces is:

Instruction::branch_i32_ne_imm(Register::from_i16(1), 0, BranchOffset16::from(3)),
Instruction::copy(2, 0),
Instruction::copy_imm32(Register::from_i16(0), 10_i32),
Instruction::return_reg(2),

The code that I would like Wasmi to produce is:

Instruction::copy(2, 0),
Instruction::branch_i32_ne_imm(Register::from_i16(1), 0, BranchOffset16::from(2)),
Instruction::copy_imm32(Register::from_i16(0), 10_i32),
Instruction::return_reg(2),

So basically in this example just swap the first 2 instructions around.
However, this is very tricky since this requires some kind of lookahead or preprocessing pass over which control flow structures manipulate which local variables in order to insert those copies in a single pass fashion. Otherwise we'd need 2 passes in order to generate code. Patching code by inserting the instructions after initial codegen is very tricky since this easily invalidates branching offsets and thus implementation becomes challenging.

I am not yet sure which route I am wiling to go down and which is the most promising and most efficient resulting in the best possible codegen while not being super hard to code and maintain.


For the more complicated nested version of the above Wasm code example:

(module
    (func (param i32 i32) (param $c0 i32) (param $c1 i32) (result i32 i32)
        local.get 0 ;; 1st return value
        local.get 1 ;; 2nd return value
        block
            local.get $c0
            br_if 0
            i32.const 10
            local.set 0  ;; conditionally overwrites (local 0) on stack
            block
                local.get $c1
                br_if 1
                i32.const 20
                local.set 1  ;; conditionally overwrites (local 1) on stack
            end
        end
    )
)

Wasmi currently produces the following Wasmi bytecode:

Instruction::branch_i32_ne_imm(Register::from_i16(2), 0, BranchOffset16::from(6)),
Instruction::copy(5, 0),
Instruction::copy_imm32(Register::from_i16(0), 10_i32),
Instruction::branch_i32_ne_imm(Register::from_i16(3), 0, BranchOffset16::from(3)),
Instruction::copy(4, 1),
Instruction::copy_imm32(Register::from_i16(1), 20_i32),
Instruction::return_reg2(5, 4),

Which is incorrect and we actually want to produce the following Wasmi bytecode ideally:

Instruction::copy(5, 0),
Instruction::copy(4, 1),
Instruction::branch_i32_ne_imm(Register::from_i16(2), 0, BranchOffset16::from(4)),
Instruction::copy_imm32(Register::from_i16(0), 10_i32),
Instruction::branch_i32_ne_imm(Register::from_i16(3), 0, BranchOffset16::from(2)),
Instruction::copy_imm32(Register::from_i16(1), 20_i32),
Instruction::return_reg2(5, 4),

Thus when scanning for a block, if and loop upon entry which locals need to be preserved we actually need to do this scan recursively for all nested block, if and loop blocks.

I implemented some naive way for local preservation before entering control flow structures which showed no significant negative impact on performance which is a strong relieve for me since I expected that.

Some WIP downsides of the PR are listed in its description here: #945

I already tested ffmpeg.wasm and the PR seems to resolve the reported issue, however, the execution of ffmpeg.wasm is now failing for an out-of-bounds memory access. This could mean one of two things: Either the PR fix is buggy and does not respect some use cases properly thus resulting in invalid behavior for ffmpeg.wasm OR there is yet another bug that prevents ffmpeg.wasm to be executed validly.

While #945 successfully fixes the bug reported in this issue, running ffmpeg.wasm with the fix crashes with yet another error:

Error: failed during execution of _start: out of bounds memory access

Tracking it down again via load and store instructions we see that the failure now happens quite a lot later than previously at load instruction execution 540480 instead of at 78500. Further investigation needs to happen. But since #945 not only demonstrates to fix the reported bug it also contains other improvements and many more (regression) tests.

Therefore I am going to merge #945 and start working on an investigation and follow-up on the remaining bug(s) until ffmpeg.wasm executes successfully under the new Wasmi (register).

Execution log of selected instructions while running ffmpeg.wasm on the fixed Wasmi can be found here:
https://gist.github.com/Robbepop/d17d6c966d69ac619f2f1e30d5728d3f

The system automatically closed this issue but I am going to re-open it immediately since, while the originally reported bug has been fixed, the ffmpeg.wasm binary still does not execute properly under the new Wasmi version.

edit: Upon further investigation it seems that the ffmpeg.wasm execution now fails within (func (;8877;) (type 5) (param i32 i32) (result i32).

@yamt #948 and #945 together fixed this issue.
I can now run ffmpeg.wasm locally and it runs successfully.

I timed the ffmpeg.wasm executions locally ...

  • Wasmi (stack): ~40s
  • Wasmi (register): ~18s

Thus ~120% speedup.

I am going to release another beta release later today so it is easier to give it a try for yourself. :)

edit: Just released Wasmi v0.32.0-beta.7 including the fixes.

@yamt #948 and #945 together fixed this issue. I can now run ffmpeg.wasm locally and it runs successfully.

I timed the ffmpeg.wasm executions locally ...

* Wasmi (stack): ~40s

* Wasmi (register): ~18s

Thus ~120% speedup.

I am going to release another beta release later today so it is easier to give it a try for yourself. :)

edit: Just released Wasmi v0.32.0-beta.7 including the fixes.

thank you!

@yamt #948 and #945 together fixed this issue. I can now run ffmpeg.wasm locally and it runs successfully.
I timed the ffmpeg.wasm executions locally ...

* Wasmi (stack): ~40s

* Wasmi (register): ~18s

Thus ~120% speedup.
I am going to release another beta release later today so it is easier to give it a try for yourself. :)
edit: Just released Wasmi v0.32.0-beta.7 including the fixes.

thank you!

Thank you again for reporting the issue in the first place. Now that these bugs have been fixed and ffmpeg.wasm demonstratively runs with the new Wasmi version I am much more confident in the code than I was before. :)