XRT sync related numerical error
newling opened this issue · 4 comments
This describes a numerical issue, and the current fix in: #218
A brief history, and summary of findings:
- matmul of size m=n=64 k=16 with int32 fails about 50% of the time for me, locally. This failure is when running on the iree-amd-aie stack (i.e. using the HAL runtime) You should be able to reproduce this by cloning the following PR (assuming you have peano, phoenix AIE, etc) : #208 which just runs the test in a loop.
On CI there are also failures, but much rarer (1 in 1000 failures).
-
To narrow down the issue, I wanted to reproduce (or not) the error on a different stack. I built mlir-air and managed to run https://github.com/Xilinx/mlir-air/tree/main/test/xrt/08_gemm_extern_vec successfully (thanks @erwei-xilinx for helping me)
-
I modified the above mlir-air test to use the xclbin and lx6 instructions generated in #208 . Running 2,000 times, I saw no numerical errors. This made me think this is NOT a compiler error (the xclbin seems to be totally fine when run through mlir-air). But rather something specific to our setup in iree-amd-aie. I started digging into the runtime, adding print statements to see when we sync: see branch https://github.com/newling/iree-amd-aie/tree/intermittent_numerics_debugging for the print additions.
-
I noticed that the mlir-air matmul test has a sync for the 'C' (C = matmul(A, B)) tensor before running the kernel, but that iree-amd-aie does not have such a sync (iree-amd-aie only syncs C after running the kernel). @nirvedhmeshram and I figured out that adding the sync here (see this current PR) makes C sync before the matmul, and resolves the numerical intermittent error.
Consider this test case:
func.func @matmul_64x64_32xi32_(%lhs : tensor<64x16xi32>, %rhs : tensor<16x64xi32>) -> tensor<64x64xi32> {
%init_acc = tensor.empty() : tensor<64x64xi32>
%c0_acc_type = arith.constant 7 : i32
%acc = linalg.fill ins(%c0_acc_type : i32) outs(%init_acc : tensor<64x64xi32>) -> tensor<64x64xi32>
%result = linalg.matmul ins(%lhs, %rhs: tensor<64x16xi32>, tensor<16x64xi32>) outs(%acc: tensor<64x64xi32>) -> tensor<64x64xi32>
return %result: tensor<64x64xi32>
}
i.e. a matmul where the initial buffer is filled with '7' (C = 7 + A @ B). compiling this through iree-amd-aie and then running iree-run-module (before the fix in this PR) as follows:
iree-run-module --module=success.vmfb --device=xrt --input="64x16xi32=1" --input="16x64xi32=2" --expected_output="64x64xi32=39"
produces about 50% successful runs (for me locally). The syncs printed (see branch https://github.com/newling/iree-amd-aie/tree/intermittent_numerics_debugging) are as follows in the case of successful runs:
Syncing instr buffer to device (XCL_BO_SYNC_BO_TO_DEVICE)
Syncing buffer from device (XCL_BO_SYNC_BO_FROM_DEVICE)
This is a buffer of size 4096
This buffer is at address 94100515692544
Syncing buffer to device (XCL_BO_SYNC_BO_TO_DEVICE)
This is a buffer of size 4096
This buffer is at address 94100515692544
Syncing buffer from device (XCL_BO_SYNC_BO_FROM_DEVICE)
This is a buffer of size 4096
This buffer is at address 94100515700736
Syncing buffer to device (XCL_BO_SYNC_BO_TO_DEVICE)
This is a buffer of size 4096
This buffer is at address 94100515700736
EXEC @matmul_64x64_32xi32_
Binding count: 3
Syncing buffer from device (XCL_BO_SYNC_BO_FROM_DEVICE)
This is a buffer of size 16384
This buffer is at address 94100515893248
Syncing buffer to device (XCL_BO_SYNC_BO_TO_DEVICE)
This is a buffer of size 16384
This buffer is at address 94100515893248
[SUCCESS] all function outputs matched their expected values.
And about 50% of the time the run fails. The logs when it fails look like:
Syncing instr buffer to device (XCL_BO_SYNC_BO_TO_DEVICE)
Syncing buffer from device (XCL_BO_SYNC_BO_FROM_DEVICE)
This is a buffer of size 4096
This buffer is at address 94816848805888
Syncing buffer to device (XCL_BO_SYNC_BO_TO_DEVICE)
This is a buffer of size 4096
This buffer is at address 94816848805888
Syncing buffer from device (XCL_BO_SYNC_BO_FROM_DEVICE)
This is a buffer of size 4096
This buffer is at address 94816848814080
Syncing buffer to device (XCL_BO_SYNC_BO_TO_DEVICE)
This is a buffer of size 4096
This buffer is at address 94816848814080
EXEC @matmul_64x64_32xi32_
Binding count: 3
Syncing buffer from device (XCL_BO_SYNC_BO_FROM_DEVICE)
This is a buffer of size 16384
This buffer is at address 94816849006592
Syncing buffer to device (XCL_BO_SYNC_BO_TO_DEVICE)
This is a buffer of size 16384
This buffer is at address 94816849006592
Syncing buffer from device (XCL_BO_SYNC_BO_FROM_DEVICE)
This is a buffer of size 16384
This buffer is at address 94816849006592
Syncing buffer to device (XCL_BO_SYNC_BO_TO_DEVICE)
This is a buffer of size 16384
This buffer is at address 94816849006592
Syncing buffer from device (XCL_BO_SYNC_BO_FROM_DEVICE)
This is a buffer of size 16384
This buffer is at address 94816849006592
Syncing buffer to device (XCL_BO_SYNC_BO_TO_DEVICE)
This is a buffer of size 16384
This buffer is at address 94816849006592
[FAILED] result[0]: element at index 4080 (0) does not match the expected (39); expected that the view is equal to contents of a view of 64x64xi32
expected:
64x64xi32=[39 39 39 39 39 39 39 39 39 39 ...
Comparing the syncs printed in the successful and failing cases, one curious difference is present: when it fails, there seem to be 3 syncs of C after the run (Note that C is the buffer of size 16384). When it passes, there is only one one sync.
If we run the command
iree-run-module --module=success.vmfb --device=xrt --input="64x16xi32=1" --input="16x64xi32=2" --output_max_element_count=99999
So that we can see the values when there is an error, we see:
39 39 39 39 39 39 39 39 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
i.e. there are 0's instead of 39's.
This makes us think that IREE is writing 0's to C after the AIE has written the result (39) back from device. There is a PR illustrating this theory over here: #224 In that PR, we see what happens manually in the cases where we
- Either do or don't write to the XRT DDR buffer for C, and then
- Either do or don't sync (calling sync(XCL_BO_SYNC_BO_TO_DEVICE)) so ensuring that that CPU cache and DDR agree.
In the case where we 1: do write to XRT and then 2: do not sync, we reproduce the iree-amd-aie bug which this current PR aims to fix.
Working on the assumption that IREE is doing a memset to 0 of size 6464sizeof(int) bytes to the XRT DDR buffer, where is this happening? At this point I have unofortunately hit a brick wall. I have inserted 100's of print statements in the IREE runtime but cannot narrow down where such a memset is happening.
Note that such a memset is not necessary for correctness: The matmul is not an accumulation, and so the initial value of the the C buffer can safely be garbage. Ideally IREE would realize that and not set C to zero.
Supposing that IREE cannot not (that is, IREE must) initialize with zeros. Is there somewhere in IREE where we can call iree_hal_buffer_invalidate_range and/or iree_hal_buffer_flush_range? These are the functions where we call the syncs. Here too I have unfortunately hit a wall trying to decipher the runtime.
This is not urgent but what I want an expert (@benvanik, @antiagainst) opinion on at some point is
-
Where is C being 0 initialized in the HAL runtime, and can this be avoided?
-
Where is iree_hal_xrt_direct_command_buffer_dispatch:
being called from in the iree runtime, and would it make sense, for that caller, to call iree_hal_buffer_invalidate_range and/or iree_hal_buffer_flush_range as well?
See also the comment: #218 (comment)
@benvanik and @antiagainst this is the numerical issue I mentioned a few weeks ago (during the GPU numerical and performance push week). I was hoping to get your input on it.
I'm not sure I fully follow but it sounds like there's some basic questions that need answering before you can triage further. There's only one way we memset: iree_hal_command_buffer_fill_buffer
- and we only do that if the program generated by the compiler tells us to with a hal.command_buffer.fill_buffer
op. You can pass --compile-to=hal
to iree-compile and then look at the IR, or set a breakpoint in the C function at runtime, or use tracy to see what's being called. There's no magic: if you don't spot a fill buffer command then there is no fill happening outside of whatever may be within the dispatch. Once you've narrowed that down if it is inside the dispatch (which I strongly believe it will be in this case) then that's a codegen issue entirely within the generated executable and not something the IREE runtime has involvement in.
FWIW, fill_buffer is not yet implemented by the XRT HAL
iree-amd-aie/runtime/src/iree-amd-aie/driver/xrt/direct_command_buffer.cc
Lines 190 to 197 in 4695d7a
so it makes me wonder if the reason for the bug is something different all together (like a driver issue) and not the sync needed after a memset theory, James and I are doing some more investigation on it.
Useful information, thanks @benvanik. So it does seem like this is a driver/xrt issue then. @nirvedhmeshram please let me know if you found anything else while I was away, else I guess we can close this issue (seems like the driver on Dave's machine was updated).