nod-ai/iree-amd-aie

matmul-elementwise bf16 model failed compilation

yzhang93 opened this issue ยท 6 comments

Input IR

!lhs = tensor<1024x512xbf16>
!rhs = tensor<512x1024xbf16>
!ele = tensor<1024x1024xf32>
!res = tensor<1024x1024xbf16>

func.func @matmul_elementwise_bf16(%lhs : !lhs, %rhs : !rhs, %ele : !ele) -> !res {
  %cst = arith.constant 0.0 : f32
  %0 = tensor.empty() : !ele
  %1 = tensor.empty() : !res
  %fill = linalg.fill ins(%cst : f32) outs(%0 : !ele) -> !ele
  %2 = linalg.matmul ins(%lhs, %rhs : !lhs, !rhs) outs(%fill : !ele) -> !ele
  %res = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%2, %ele : !ele, !ele) outs(%1 : !res) {
  ^bb0(%in: f32, %in_0: f32, %out: bf16):
    %11 = arith.addf %in, %in_0 : f32
    %12 = arith.truncf %11 : f32 to bf16
    linalg.yield %12 : bf16
  } -> !res
  return %res : !res
}

Error:

LLVM ERROR: unable to legalize instruction: %1730:_(<1024 x s16>) = G_SHUFFLE_VECTOR %1729:_(<1024 x s16>), %1475:_, shufflemaskin function: core_0_2)
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.	Program arguments: /proj/xsjhdstaff4/vivizhan/llvm-aie/install/bin/llc /proj/xsjhdstaff4/vivizhan/iree-amd-aie/build_tools/ci/cpu_comparison/test_result_bf16/module_matmul_elementwise_bf16_dispatch_0_amdaie_xclbin_fb/input.opt.ll -O2 --march=aie2 --function-sections --filetype=obj -o /proj/xsjhdstaff4/vivizhan/iree-amd-aie/build_tools/ci/cpu_comparison/test_result_bf16/module_matmul_elementwise_bf16_dispatch_0_amdaie_xclbin_fb/input.o
1.	Running pass 'Function Pass Manager' on module '/proj/xsjhdstaff4/vivizhan/iree-amd-aie/build_tools/ci/cpu_comparison/test_result_bf16/module_matmul_elementwise_bf16_dispatch_0_amdaie_xclbin_fb/input.opt.ll'.
2.	Running pass 'Legalizer' on function '@core_0_2'
 #0 0x000055ae9b6ceebf llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /proj/rdi/staff/vivizhan/llvm-aie/llvm/lib/Support/Unix/Signals.inc:567:22
 #1 0x000055ae9b6ccfc4 llvm::sys::RunSignalHandlers() /proj/rdi/staff/vivizhan/llvm-aie/llvm/lib/Support/Signals.cpp:104:20
 #2 0x000055ae9b6cd146 SignalHandler(int) /proj/rdi/staff/vivizhan/llvm-aie/llvm/lib/Support/Unix/Signals.inc:412:1
 #3 0x00007fa6da842520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
 #4 0x00007fa6da8969fc __pthread_kill_implementation ./nptl/pthread_kill.c:44:76
 #5 0x00007fa6da8969fc __pthread_kill_internal ./nptl/pthread_kill.c:78:10
 #6 0x00007fa6da8969fc pthread_kill ./nptl/pthread_kill.c:89:10
 #7 0x00007fa6da842476 gsignal ./signal/../sysdeps/posix/raise.c:27:6
 #8 0x00007fa6da8287f3 abort ./stdlib/abort.c:81:7
 #9 0x000055ae9b6438d3 (/proj/xsjhdstaff4/vivizhan/llvm-aie/install/bin/llc+0x2cd98d3)
#10 0x000055ae9bb25532 reportGISelDiagnostic(llvm::DiagnosticSeverity, llvm::MachineFunction&, llvm::TargetPassConfig const&, llvm::MachineOptimizationRemarkEmitter&, llvm::MachineOptimizationRemarkMissed&) /proj/rdi/staff/vivizhan/llvm-aie/llvm/lib/CodeGen/GlobalISel/Utils.cpp:257:23
#11 0x000055ae9bb26f5b llvm::DiagnosticInfoOptimizationBase::~DiagnosticInfoOptimizationBase() /proj/rdi/staff/vivizhan/llvm-aie/llvm/include/llvm/IR/DiagnosticInfo.h:413:7
#12 0x000055ae9bb26f5b llvm::DiagnosticInfoMIROptimization::~DiagnosticInfoMIROptimization() /proj/rdi/staff/vivizhan/llvm-aie/llvm/include/llvm/CodeGen/MachineOptimizationRemarkEmitter.h:30:7
#13 0x000055ae9bb26f5b llvm::MachineOptimizationRemarkMissed::~MachineOptimizationRemarkMissed() /proj/rdi/staff/vivizhan/llvm-aie/llvm/include/llvm/CodeGen/MachineOptimizationRemarkEmitter.h:84:7
#14 0x000055ae9bb26f5b llvm::reportGISelFailure(llvm::MachineFunction&, llvm::TargetPassConfig const&, llvm::MachineOptimizationRemarkEmitter&, char const*, llvm::StringRef, llvm::MachineInstr const&) /proj/rdi/staff/vivizhan/llvm-aie/llvm/lib/CodeGen/GlobalISel/Utils.cpp:286:1
#15 0x000055ae9babdb82 llvm::Legalizer::runOnMachineFunction(llvm::MachineFunction&) (.part.0) /proj/rdi/staff/vivizhan/llvm-aie/llvm/lib/CodeGen/GlobalISel/Legalizer.cpp:348:12
#16 0x000055ae9a7f9b3b llvm::MachineFunctionPass::runOnFunction(llvm::Function&) (.part.0) /proj/rdi/staff/vivizhan/llvm-aie/llvm/lib/CodeGen/MachineFunctionPass.cpp:91:33
#17 0x000055ae9ad2eaec llvm::FPPassManager::runOnFunction(llvm::Function&) /proj/rdi/staff/vivizhan/llvm-aie/llvm/lib/IR/LegacyPassManager.cpp:1440:7
#18 0x000055ae9ad2ed19 llvm::ilist_node_base<true>::getNext() const /proj/rdi/staff/vivizhan/llvm-aie/llvm/include/llvm/ADT/ilist_node_base.h:43:45
#19 0x000055ae9ad2ed19 llvm::ilist_node_impl<llvm::ilist_detail::node_options<llvm::Function, true, false, void>>::getNext() /proj/rdi/staff/vivizhan/llvm-aie/llvm/include/llvm/ADT/ilist_node.h:67:66
#20 0x000055ae9ad2ed19 llvm::ilist_iterator<llvm::ilist_detail::node_options<llvm::Function, true, false, void>, false, false>::operator++() /proj/rdi/staff/vivizhan/llvm-aie/llvm/include/llvm/ADT/ilist_iterator.h:157:25
#21 0x000055ae9ad2ed19 llvm::FPPassManager::runOnModule(llvm::Module&) /proj/rdi/staff/vivizhan/llvm-aie/llvm/lib/IR/LegacyPassManager.cpp:1475:22
#22 0x000055ae9ad2f59e runOnModule /proj/rdi/staff/vivizhan/llvm-aie/llvm/lib/IR/LegacyPassManager.cpp:1552:7
#23 0x000055ae9ad2f59e llvm::legacy::PassManagerImpl::run(llvm::Module&) /proj/rdi/staff/vivizhan/llvm-aie/llvm/lib/IR/LegacyPassManager.cpp:535:55
#24 0x000055ae99e4601e compileModule(char**, llvm::LLVMContext&) /proj/rdi/staff/vivizhan/llvm-aie/llvm/tools/llc/llc.cpp:736:66
#25 0x000055ae99e46f86 main /proj/rdi/staff/vivizhan/llvm-aie/llvm/tools/llc/llc.cpp:420:35
#26 0x00007fa6da829d90 __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
#27 0x00007fa6da829e40 call_init ./csu/../csu/libc-start.c:128:20
#28 0x00007fa6da829e40 __libc_start_main ./csu/../csu/libc-start.c:379:5
#29 0x000055ae99e3a2e5 _start (/proj/xsjhdstaff4/vivizhan/llvm-aie/install/bin/llc+0x14d02e5)

In contrast, bf16-f32 model (without arith.truncf %11 : f32 to bf16) as below doesn't have such error.

!lhs = tensor<1024x512xbf16>
!rhs = tensor<512x1024xbf16>
!ele = tensor<1024x1024xf32>
!res = tensor<1024x1024xf32>

func.func @matmul_elementwise_bf16(%lhs : !lhs, %rhs : !rhs, %ele : !ele) -> !res {
  %cst = arith.constant 0.0 : f32
  %0 = tensor.empty() : !ele
  %fill = linalg.fill ins(%cst : f32) outs(%0 : !ele) -> !ele
  %2 = linalg.matmul ins(%lhs, %rhs : !lhs, !rhs) outs(%fill : !ele) -> !ele
  %res = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%2, %ele : !ele, !ele) outs(%0 : !ele) {
  ^bb0(%in: f32, %in_0: f32, %out: f32):
    %11 = arith.addf %in, %in_0 : f32
    linalg.yield %11 : f32
  } -> !res
  return %res : !res
}

@MaheshRavishankar @stephenneuendorffer @newling @erwei-xilinx Any insight about the issue?

I dont know if Peano handles bf16 natively.

I believe there's work going on to implement shuffle_vector. currently the assumption is that the vector ops always go through intrinsics. FYI, for Peano issues, you're better off capturing the .ll code and creating an issue in the peano repo.

Peano does support bf16 types, and there is indeed work to support more and more cases of generic shuffle_vector. However, I think the problem here is rather that %1730:_(<1024 x s16>) is a huge vector, and we do not have the capability yet to properly legalize those. As Stephen said, it would be very useful if you could get us a small .ll reproducer, then we can investigate what's really happening here :)

Support for G_SHUFFLE_VECTOR for Peano is soon under review, so that should land soonish. The failing instruction asks for 16-bit so it is not the support for bf in any case. There are two problems with the code as is:

  • Outputted vector is 4 times larger than the largest register we have available, the code doesn't currently handle that. This is something can be added Peano though, I just haven't got around to dealing with that yet.
  • In the generic case is G_SHUFFLE_VECTOR incredibly slow since it needs to extract each value of the vector in turn and then reconstruct the vector by element. This instruction only changes the last 64 bytes, so it will do 32.640 bytes of useless memory operations. We can reduce this a lot by matching the patterns that you depend on and replace it with better instructions, but that does require us to know which G_SHUFFLE_VECTOR masks are required.

Thanks @stephenneuendorffer @gbossu @ValentijnvdBeek for looking into the issue! Here are the .ll files generated from the above example. Please let me know if you need me to provide other sources.
input_ll.zip