nod-ai/iree-amd-aie

XRT sync related numerical error

newling opened this issue · 4 comments

This describes a numerical issue, and the current fix in: #218

A brief history, and summary of findings:

  • matmul of size m=n=64 k=16 with int32 fails about 50% of the time for me, locally. This failure is when running on the iree-amd-aie stack (i.e. using the HAL runtime) You should be able to reproduce this by cloning the following PR (assuming you have peano, phoenix AIE, etc) : #208 which just runs the test in a loop.

On CI there are also failures, but much rarer (1 in 1000 failures).

  • To narrow down the issue, I wanted to reproduce (or not) the error on a different stack. I built mlir-air and managed to run https://github.com/Xilinx/mlir-air/tree/main/test/xrt/08_gemm_extern_vec successfully (thanks @erwei-xilinx for helping me)

  • I modified the above mlir-air test to use the xclbin and lx6 instructions generated in #208 . Running 2,000 times, I saw no numerical errors. This made me think this is NOT a compiler error (the xclbin seems to be totally fine when run through mlir-air). But rather something specific to our setup in iree-amd-aie. I started digging into the runtime, adding print statements to see when we sync: see branch https://github.com/newling/iree-amd-aie/tree/intermittent_numerics_debugging for the print additions.

  • I noticed that the mlir-air matmul test has a sync for the 'C' (C = matmul(A, B)) tensor before running the kernel, but that iree-amd-aie does not have such a sync (iree-amd-aie only syncs C after running the kernel). @nirvedhmeshram and I figured out that adding the sync here (see this current PR) makes C sync before the matmul, and resolves the numerical intermittent error.

Consider this test case:

func.func @matmul_64x64_32xi32_(%lhs : tensor<64x16xi32>, %rhs : tensor<16x64xi32>) -> tensor<64x64xi32> {
  %init_acc = tensor.empty() : tensor<64x64xi32>
  %c0_acc_type = arith.constant 7 : i32
  %acc = linalg.fill ins(%c0_acc_type : i32) outs(%init_acc : tensor<64x64xi32>) -> tensor<64x64xi32>
  %result = linalg.matmul ins(%lhs, %rhs: tensor<64x16xi32>, tensor<16x64xi32>) outs(%acc: tensor<64x64xi32>) -> tensor<64x64xi32>
  return %result: tensor<64x64xi32>
}

i.e. a matmul where the initial buffer is filled with '7' (C = 7 + A @ B). compiling this through iree-amd-aie and then running iree-run-module (before the fix in this PR) as follows:

iree-run-module --module=success.vmfb --device=xrt --input="64x16xi32=1" --input="16x64xi32=2" --expected_output="64x64xi32=39"

produces about 50% successful runs (for me locally). The syncs printed (see branch https://github.com/newling/iree-amd-aie/tree/intermittent_numerics_debugging) are as follows in the case of successful runs:

Syncing instr buffer to device (XCL_BO_SYNC_BO_TO_DEVICE)


Syncing buffer from device (XCL_BO_SYNC_BO_FROM_DEVICE)
This is a buffer of size 4096
This buffer is at address 94100515692544


Syncing buffer to device (XCL_BO_SYNC_BO_TO_DEVICE)
This is a buffer of size 4096
This buffer is at address 94100515692544


Syncing buffer from device (XCL_BO_SYNC_BO_FROM_DEVICE)
This is a buffer of size 4096
This buffer is at address 94100515700736


Syncing buffer to device (XCL_BO_SYNC_BO_TO_DEVICE)
This is a buffer of size 4096
This buffer is at address 94100515700736

EXEC @matmul_64x64_32xi32_
Binding count: 3

Syncing buffer from device (XCL_BO_SYNC_BO_FROM_DEVICE)
This is a buffer of size 16384
This buffer is at address 94100515893248


Syncing buffer to device (XCL_BO_SYNC_BO_TO_DEVICE)
This is a buffer of size 16384
This buffer is at address 94100515893248

[SUCCESS] all function outputs matched their expected values.

And about 50% of the time the run fails. The logs when it fails look like:

Syncing instr buffer to device (XCL_BO_SYNC_BO_TO_DEVICE)


Syncing buffer from device (XCL_BO_SYNC_BO_FROM_DEVICE)
This is a buffer of size 4096
This buffer is at address 94816848805888


Syncing buffer to device (XCL_BO_SYNC_BO_TO_DEVICE)
This is a buffer of size 4096
This buffer is at address 94816848805888


Syncing buffer from device (XCL_BO_SYNC_BO_FROM_DEVICE)
This is a buffer of size 4096
This buffer is at address 94816848814080


Syncing buffer to device (XCL_BO_SYNC_BO_TO_DEVICE)
This is a buffer of size 4096
This buffer is at address 94816848814080

EXEC @matmul_64x64_32xi32_
Binding count: 3

Syncing buffer from device (XCL_BO_SYNC_BO_FROM_DEVICE)
This is a buffer of size 16384
This buffer is at address 94816849006592


Syncing buffer to device (XCL_BO_SYNC_BO_TO_DEVICE)
This is a buffer of size 16384
This buffer is at address 94816849006592


Syncing buffer from device (XCL_BO_SYNC_BO_FROM_DEVICE)
This is a buffer of size 16384
This buffer is at address 94816849006592


Syncing buffer to device (XCL_BO_SYNC_BO_TO_DEVICE)
This is a buffer of size 16384
This buffer is at address 94816849006592


Syncing buffer from device (XCL_BO_SYNC_BO_FROM_DEVICE)
This is a buffer of size 16384
This buffer is at address 94816849006592


Syncing buffer to device (XCL_BO_SYNC_BO_TO_DEVICE)
This is a buffer of size 16384
This buffer is at address 94816849006592

[FAILED] result[0]: element at index 4080 (0) does not match the expected (39); expected that the view is equal to contents of a view of 64x64xi32
  expected:
64x64xi32=[39 39 39 39 39 39 39 39 39 39 ...

Comparing the syncs printed in the successful and failing cases, one curious difference is present: when it fails, there seem to be 3 syncs of C after the run (Note that C is the buffer of size 16384). When it passes, there is only one one sync.

If we run the command

iree-run-module --module=success.vmfb --device=xrt --input="64x16xi32=1" --input="16x64xi32=2" --output_max_element_count=99999

So that we can see the values when there is an error, we see:

39 39 39 39 39 39 39 39 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

i.e. there are 0's instead of 39's.

This makes us think that IREE is writing 0's to C after the AIE has written the result (39) back from device. There is a PR illustrating this theory over here: #224 In that PR, we see what happens manually in the cases where we

  1. Either do or don't write to the XRT DDR buffer for C, and then
  2. Either do or don't sync (calling sync(XCL_BO_SYNC_BO_TO_DEVICE)) so ensuring that that CPU cache and DDR agree.

In the case where we 1: do write to XRT and then 2: do not sync, we reproduce the iree-amd-aie bug which this current PR aims to fix.

Working on the assumption that IREE is doing a memset to 0 of size 6464sizeof(int) bytes to the XRT DDR buffer, where is this happening? At this point I have unofortunately hit a brick wall. I have inserted 100's of print statements in the IREE runtime but cannot narrow down where such a memset is happening.

Note that such a memset is not necessary for correctness: The matmul is not an accumulation, and so the initial value of the the C buffer can safely be garbage. Ideally IREE would realize that and not set C to zero.

Supposing that IREE cannot not (that is, IREE must) initialize with zeros. Is there somewhere in IREE where we can call iree_hal_buffer_invalidate_range and/or iree_hal_buffer_flush_range? These are the functions where we call the syncs. Here too I have unfortunately hit a wall trying to decipher the runtime.

This is not urgent but what I want an expert (@benvanik, @antiagainst) opinion on at some point is

  1. Where is C being 0 initialized in the HAL runtime, and can this be avoided?

  2. Where is iree_hal_xrt_direct_command_buffer_dispatch:

    /*.dispatch = */ iree_hal_xrt_direct_command_buffer_dispatch,
    being called from in the iree runtime, and would it make sense, for that caller, to call iree_hal_buffer_invalidate_range and/or iree_hal_buffer_flush_range as well?

See also the comment: #218 (comment)

@benvanik and @antiagainst this is the numerical issue I mentioned a few weeks ago (during the GPU numerical and performance push week). I was hoping to get your input on it.

I'm not sure I fully follow but it sounds like there's some basic questions that need answering before you can triage further. There's only one way we memset: iree_hal_command_buffer_fill_buffer - and we only do that if the program generated by the compiler tells us to with a hal.command_buffer.fill_buffer op. You can pass --compile-to=hal to iree-compile and then look at the IR, or set a breakpoint in the C function at runtime, or use tracy to see what's being called. There's no magic: if you don't spot a fill buffer command then there is no fill happening outside of whatever may be within the dispatch. Once you've narrowed that down if it is inside the dispatch (which I strongly believe it will be in this case) then that's a codegen issue entirely within the generated executable and not something the IREE runtime has involvement in.

FWIW, fill_buffer is not yet implemented by the XRT HAL

static iree_status_t iree_hal_xrt_direct_command_buffer_fill_buffer(
iree_hal_command_buffer_t* base_command_buffer,
iree_hal_buffer_t* target_buffer, iree_device_size_t target_offset,
iree_device_size_t length, const void* pattern,
iree_host_size_t pattern_length) {
return iree_make_status(IREE_STATUS_UNIMPLEMENTED,
"fill buffer not yet supported");
}

so it makes me wonder if the reason for the bug is something different all together (like a driver issue) and not the sync needed after a memset theory, James and I are doing some more investigation on it.

Useful information, thanks @benvanik. So it does seem like this is a driver/xrt issue then. @nirvedhmeshram please let me know if you found anything else while I was away, else I guess we can close this issue (seems like the driver on Dave's machine was updated).