
Supporting matmul tranpose variants

newling opened this issue · 0 comments

Support for all variants of matmul tranpose

We should support all transpose-variants of matmul / batch_matmul / GEMM:
matmul(A, B), matmul(A, B.T), matmul(A.T, B), matmul(A.T, B.T)

Only one of these variants has "native" support in the AIE core, i.e. the intrinsic expects a specific layout for A and B.

We therefore need to transpose data before explicitly. Either before the matmul, on the fly in DMA, and/or in the core.

For matmul(A, B.T), on the core we want to perform a matmul with a chunk of A

[[ A00, A01, A02, A03, A04, A05, A06, A07],
[ A10, A11, A12, A13, A14, A15, A16, A17],
[ A20, A21, A22, A23, A24, A25, A26, A27],
[ A30, A31, A32, A33, A34, A35, A36, A37]]

and a chunk of B:

[[ B00, B01, B02, B03, B04, B05, B06, B07],
[ B10, B11, B12, B13, B14, B15, B16, B17],
[ B20, B21, B22, B23, B24, B25, B26, B27],
[ B30, B31, B32, B33, B34, B35, B36, B37]]

If we don't do any transposing of B in DMAs, the 32 values for the B matrix arrive on the core as

(i) [B00, B01, B02, B03, B04, B05, B06, B07, B10, B11, B12, B13, B14, B15, B16, B17, B20, B21, B22, B23, B24, B25, B26, B27, B30, B31, B32, B33, B34, B35, B36, B37]

For the matmul instruction for bf16 on AIE2 I think the expected layout is not transposed, i.e. it must be as follows in memory:
(ii) [B00, B10, B20, B30, B01, B11, B21, B31, B02, B12, B22, B32, B03, B13, B23, B33, B04, B14, B24, B34, B05, B15, B25, B35, B06, B16, B26, B36, B07, B17, B27, B37]

(Note that this might change in future architectures, i.e. the matmul intrinsic might expect B to be in the transpose layout already). To be confirmed @erwei-xilinx

We cannot get to the layout (ii) for bfloat16 as DMAs can't split 32-bit elements (and shouldn't even split 128-bit elements if you want to use the full DMA bandwidth).

So what is the best order we can deliver B to the core in to minimize the overhead of rearrangement that the core must do? For example, we could deliver B as

(iii) [B00, B01, B10, B11, B20, B21, B30, B31, B02, B03, B12, B13, B22, B23, B32, B33, B04, B05, B14, B15, B24, B25, B34, B35, B06, B07, B16, B17, B26, B27, B36, B37]

But would that be better than delivering B as (i)? Currently we deliver it as (i), and aievec handles this by rearranging the B matrix in the core using a single transpose (shuffle) intrinsic. See . We haven't done any performance analysis on this approach. There might be benefit if performing the transpose on larger tiles (see for example Could somehow amortize the cost of doing the transpose (cc @jsetoain)

A completely alternative approach would be to perform all necessary / optimizing tranposes before the matmul. These transposes might be mergeable with the producers of the operands. TODO: find out if there is any support for this transformation in IREE.

We should probably support both approaches: before matmul AND during matmul.

cc @jtuyls thoughts?