nod-ai/iree-amd-aie

[RFC] Peeled matmul on MLIR-AIE

jtuyls opened this issue · 3 comments

The goal of this RFC is to discuss how peeled matmul IR can be lowered to AIE code. cc @MaheshRavishankar @erwei-xilinx @nirvedhmeshram @yzhang93 @Abhishek-Varma

A few notes before to start:

  • This RFC uses objectfifos to represent AIE DMA operations, which is currently not supported through AIR (WIP: Xilinx/mlir-air#439), however, this is just a convenient representation for AIE DMA code used in the MLIR-AIE reference designs and other representations could be used as well to accomplish the same results.
  • The RFC omits the many details on how to get to objectfifos through AIR, but after inspecting the IR transformation, one should be confident that this can be accomplished. Also, large parts of this should already exist (see PR link above again).

RFC

To start, here is some sample peeled matmul IR:

...
scf.forall (%arg4, %arg5) in (1, 2) {
  %5 = affine.apply #map(%arg4)
  %6 = affine.apply #map(%arg5)
  %subview_7 = memref.subview %alloc_4[%5, 0] [32, 64] [1, 1] : memref<32x64xi32, 1> to memref<32x64xi32, strided<[64, 1], offset: ?>, 1>
  %subview_8 = memref.subview %alloc_3[0, %6] [64, 32] [1, 1] : memref<64x64xi32, 1> to memref<64x32xi32, strided<[64, 1], offset: ?>, 1>
  %subview_9 = memref.subview %alloc_2[%5, %6] [32, 32] [1, 1] : memref<32x64xi32, 1> to memref<32x32xi32, strided<[64, 1], offset: ?>, 1>
  linalg.fill ins(%c0_i32 : i32) outs(%alloc_1 : memref<4x8x4x8xi32, 2>)
  iree_linalg_ext.pack %subview_7 outer_dims_perm = [1, 0] inner_dims_pos = [0, 1] inner_tiles = [4, 8] into %alloc_0 : (memref<32x64xi32, strided<[64, 1], offset: ?>, 1> memref<8x8x4x8xi32, 2>)
  iree_linalg_ext.pack %subview_8 outer_dims_perm = [1, 0] inner_dims_pos = [0, 1] inner_tiles = [8, 8] into %alloc : (memref<64x32xi32, strided<[64, 1], offset: ?>, 1> memref<4x8x8x8xi32, 2>)
  linalg.generic {indexing_maps = [#map2, #map3, #map4], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction"]} ins(%alloc_0, %alloc : memref<8x8x4x8xi32, 2>, memref<4x8x8x8xi32, 2>) outs(%alloc_1 : memref<4x8x4x8xi32, 2>) {
  ^bb0(%in: i32, %in_10: i32, %out: i32):
    %7 = arith.muli %in, %in_10 : i32
    %8 = arith.addi %out, %7 : i32
    linalg.yield %8 : i32
  }
  iree_linalg_ext.unpack %alloc_1 outer_dims_perm = [1, 0] inner_dims_pos = [0, 1] inner_tiles = [4, 8] into %subview_9 : (memref<4x8x4x8xi32, 2> memref<32x32xi32, strided<[64, 1], offset: ?>, 1>)
} {mapping = [#gpu.thread<y>, #gpu.thread<x>]}
linalg.copy ins(%alloc_2 : memref<32x64xi32, 1>) outs(%subview : memref<32x64xi32, strided<[64, 1], offset: ?>>)
scf.for %arg4 = %c64 to %c1024 step %c64 {
  %subview_7 = memref.subview %1[0, %arg4] [32, 64] [1, 1] : memref<32x1024xi32, strided<[?, ?], offset: ?>> to memref<32x64xi32, strided<[?, ?], offset: ?>>
  %subview_8 = memref.subview %0[%arg4, 0] [64, 64] [1, 1] : memref<1024x64xi32, strided<[?, ?], offset: ?>> to memref<64x64xi32, strided<[?, ?], offset: ?>>
  linalg.copy ins(%subview_7 : memref<32x64xi32, strided<[?, ?], offset: ?>>) outs(%alloc_4 : memref<32x64xi32, 1>)
  linalg.copy ins(%subview_8 : memref<64x64xi32, strided<[?, ?], offset: ?>>) outs(%alloc_3 : memref<64x64xi32, 1>)
  linalg.copy ins(%subview : memref<32x64xi32, strided<[64, 1], offset: ?>>) outs(%alloc_2 : memref<32x64xi32, 1>)
  scf.forall (%arg5, %arg6) in (1, 2) {
    %5 = affine.apply #map(%arg5)
    %6 = affine.apply #map(%arg6)
    %subview_9 = memref.subview %alloc_4[%5, 0] [32, 64] [1, 1] : memref<32x64xi32, 1> to memref<32x64xi32, strided<[64, 1], offset: ?>, 1>
    %subview_10 = memref.subview %alloc_3[0, %6] [64, 32] [1, 1] : memref<64x64xi32, 1> to memref<64x32xi32, strided<[64, 1], offset: ?>, 1>
    %subview_11 = memref.subview %alloc_2[%5, %6] [32, 32] [1, 1] : memref<32x64xi32, 1> to memref<32x32xi32, strided<[64, 1], offset: ?>, 1>
    iree_linalg_ext.pack %subview_11 outer_dims_perm = [1, 0] inner_dims_pos = [0, 1] inner_tiles = [4, 8] into %alloc_1 : (memref<32x32xi32, strided<[64, 1], offset: ?>, 1> memref<4x8x4x8xi32, 2>)
    iree_linalg_ext.pack %subview_9 outer_dims_perm = [1, 0] inner_dims_pos = [0, 1] inner_tiles = [4, 8] into %alloc_0 : (memref<32x64xi32, strided<[64, 1], offset: ?>, 1> memref<8x8x4x8xi32, 2>)
    iree_linalg_ext.pack %subview_10 outer_dims_perm = [1, 0] inner_dims_pos = [0, 1] inner_tiles = [8, 8] into %alloc : (memref<64x32xi32, strided<[64, 1], offset: ?>, 1> memref<4x8x8x8xi32, 2>)
    linalg.generic {indexing_maps = [#map2, #map3, #map4], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction"]} ins(%alloc_0, %alloc : memref<8x8x4x8xi32, 2>, memref<4x8x8x8xi32, 2>) outs(%alloc_1 : memref<4x8x4x8xi32, 2>) {
    ^bb0(%in: i32, %in_12: i32, %out: i32):
      %7 = arith.muli %in, %in_12 : i32
      %8 = arith.addi %out, %7 : i32
      linalg.yield %8 : i32
    }
    iree_linalg_ext.unpack %alloc_1 outer_dims_perm = [1, 0] inner_dims_pos = [0, 1] inner_tiles = [4, 8] into %subview_11 : (memref<4x8x4x8xi32, 2> memref<32x32xi32, strided<[64, 1], offset: ?>, 1>)
  } {mapping = [#gpu.thread<y>, #gpu.thread<x>]}
  linalg.copy ins(%alloc_2 : memref<32x64xi32, 1>) outs(%subview : memref<32x64xi32, strided<[64, 1], offset: ?>>)
}
...

Issues with lowering peeled matmul IR to AIE code:

  1. Peeling creates multiple regions of AIE core code with data movement in between: 1) Fill + first iteration of reduction loop 2) remaining iterations of reduction loop. However, only a single elf can be loaded per core right now.
  2. Peeling inserts duplicated copies and packs, representing data movement on AIE, as you can see in the snippet above if you look at the output of the first region (fill + matmul) being unpacked from L1 to L2 and copied from L2 to L3, before being moved back into L2 and L1 through subsequent copy and pack operations. This creates a couple of sub-issues: 1) it's inefficient, especially on AIE and 2) it's harder to generate (efficient) control code for. This is the case because on AIE, the peeled copies/packs can be executed together with the unpeeled ones trough a single set of DMA instructions. Therefore, on the one hand, we introduce peeling to accomplish fusion at L1 (fill + matmul), but afterwards, we need to recover the information that the peeled and unpeeled data movement can be executed as a single DMA instruction. Additionally, the data movement part of the explicitly peeled IR seems to be unexecutable right now as is because a state machine can't always be created for this with limited resources AND not all DMAs can be reprogrammed in the current mlir-aie flow (the memtile and core DMAs are programmed once per xclbin load).

Now, looking at the AIE core side, peeled core code can be accomplished as shown in the snipped below. The goal is to get to something like this on the AIE core side, starting from the higher level peeled matmul IR above.

%core_0_2 = aie.core(%tile_0_2) {
  %0 = aie.objectfifo.acquire @outC0(Produce, 1) : !aie.objectfifosubview<memref<32xf32>>
  %1 = aie.objectfifo.subview.access %0[0] : !aie.objectfifosubview<memref<32xf32>> -> memref<32xf32>
  %2 = aie.objectfifo.acquire @inA0(Consume, 1) : !aie.objectfifosubview<memref<32x32xbf16>>
  %3 = aie.objectfifo.subview.access %2[0] : !aie.objectfifosubview<memref<32x32xbf16>> -> memref<32x32xbf16>
  %4 = aie.objectfifo.acquire @inB(Consume, 1) : !aie.objectfifosubview<memref<32xbf16>>
  %5 = aie.objectfifo.subview.access %4[0] : !aie.objectfifosubview<memref<32xbf16>> -> memref<32xbf16>
  func.call @zero_vectorized_f32(%1) : (memref<32xf32>) -> ()
  func.call @matvec_vectorized_bf16_f32(%3, %5, %1) : (memref<32x32xbf16>, memref<32xbf16>, memref<32xf32>) -> ()
  aie.objectfifo.release @inA0(Consume, 1)
  aie.objectfifo.release @inB(Consume, 1)
  %c0_0 = arith.constant 0 : index
  %c1_1 = arith.constant 1 : index
  %c1_2 = arith.constant 1 : index
  scf.for %arg1 = %c0_0 to %c1_1 step %c1_2 {
    %6 = aie.objectfifo.acquire @inA0(Consume, 1) : !aie.objectfifosubview<memref<32x32xbf16>>
    %7 = aie.objectfifo.subview.access %6[0] : !aie.objectfifosubview<memref<32x32xbf16>> -> memref<32x32xbf16>
    %8 = aie.objectfifo.acquire @inB(Consume, 1) : !aie.objectfifosubview<memref<32xbf16>>
    %9 = aie.objectfifo.subview.access %8[0] : !aie.objectfifosubview<memref<32xbf16>> -> memref<32xbf16>
    func.call @matvec_vectorized_bf16_f32(%7, %9, %1) : (memref<32x32xbf16>, memref<32xbf16>, memref<32xf32>) -> ()
    aie.objectfifo.release @inA0(Consume, 1)
    aie.objectfifo.release @inB(Consume, 1)
  }
  aie.objectfifo.release @outC0(Produce, 1)
  aie.end
} {link_with = "mv.o"}

Options

  1. Lower all regions to logical objectfifo DMA state machines and if applicable superimpose the state machines to be able to generate (efficient) AIE control code.
  2. Lower the IR to an objectfifo DMA state machine before doing peeling (and possibly other transformations. This assumes that all transformations after this objectfifo DMA state machine generation won't change the DMA configurations (like peeling shouldn't). However, this is a big assumption and seems hard to guarantee.

For now, only option 1 is worked out to be discussed:

Option 1 (Superimpose region DMA state machines)

This option can be accomplished through following steps:

  1. Lower to AIR (BridgeToAir, PackToDma, CopyToDma, etc).
  2. From DMAs, derive objectfifo state machines for each region (DmaToObjectFifo).
  3. Transformation to bring outer loops (e.g. reduction loop) into AIE control code. This avoids regions being in different scopes and should help with performance as well in the long run (AieBringInLoopsIntoAieControlCode).
  4. Transformation to bring outer loops into AIE core code to avoid multiple code blocks that need a reload of the AIE cores' 'main' (AieBringInLoopsIntoAieCoreCode).
  5. Try superimposing the objectfifo DMA state machines from different regions (AieSuperimposeStateMachines)
  6. Try combining AIE core code into a single code block (to achieve a single elf) to avoid reloading main (AieCombineCoreCode)

The transformations above warrants a demonstration of conceptual lowering which you can find here: https://gist.github.com/jtuyls/7e6a41619666fa3186b1a8156978eedc

Option 2 (Lower to objectfifo DMA state machine before peeling)

Not worked out for now.

Thanks for the detailed description on the plan. In MLIR-AIR there is a pass air-fuse-channels which looks at the sending and receiving ends of channels, and attempt to fuse them, so that fused channels get lowered to the same set of hardware resources.

I wonder if the pass might be useful in materializing your plan when we go from pack/copy -> air.dma -> (async) air.channel -> (fused) air.channel -> aie.objectFifo.

Here are some CI tests showing how the pass works: https://github.com/Xilinx/mlir-air/blob/main/mlir/test/Transform/AIRDependencyScheduleOpt/fuse_channels.mlir

I experimented with option 1 above and here is an updated gist with partially lowered IR FYI: https://gist.github.com/jtuyls/adafd09f9fc4ac8e2a85e4e3b2a4aead @nirvedhmeshram @yzhang93

Closing this as the objectFifo lowering pipeline has been added now with following (main) PRs:

  1. #267
  2. #280
  3. #302
  4. #314
  5. #314
  6. #343
  7. #348
  8. #355
  9. #357
  10. #396
  11. #406
  12. #413
  13. #457
  14. #473