tensorflow/mlir-hlo

lhlo-copy-removal pass crash

bondhugula opened this issue · 5 comments

I'm not sure whether issues can be posted on this repo. If not, I can move it to tensorflow proper.

This can be reproduced with a recent commit (d4dcba1340f363762cc6003d4ed1f4db2df61858) and in all certainty with the trunk as well. The input isn't really the expected one for this pass, but this is a bug apparently stemming from an assumption on the input. A check / bail out would have been fine for example.

Input:

func @func_op_long(%arg0: memref<4xf32>, %arg1: memref<4xf32>, %arg2: memref<4xf32>) {
    %0 = alloc() : memref<4xf32>
    affine.for %arg3 = 0 to 4 {
      %5 = affine.load %arg0[%arg3] : memref<4xf32>
      %6 = affine.load %arg1[%arg3] : memref<4xf32>
      %7 = cmpf "ogt", %5, %6 : f32
      %8 = select %7, %5, %6 : f32
      affine.store %8, %0[%arg3] : memref<4xf32>
    }
    "lmhlo.copy"(%0, %arg2) : (memref<4xf32>, memref<4xf32>) -> ()
    return
  }
$ mlir-hlo-opt -lhlo-copy-removal   /tmp/crash.mlir 
mlir-hlo-opt: external/llvm-project/mlir/lib/IR/Operation.cpp:330: bool mlir::Operation::isBeforeInBlock(mlir::Operation*): Assertion `other && other->block == block && "Expected other operation to have the same parent block."' failed.
PLEASE submit a bug report to  and include the crash backtrace.
Stack dump:
0.	Program arguments: bazel-bin/tensorflow/compiler/mlir/hlo/mlir-hlo-opt -lhlo-copy-removal /tmp/crash.mlir 
 #0 0x00000000014c1d7d llvm::sys::PrintStackTrace(llvm::raw_ostream&) (bazel-bin/tensorflow/compiler/mlir/hlo/mlir-hlo-opt+0x14c1d7d)
 #1 0x00000000014bfaed llvm::sys::RunSignalHandlers() (bazel-bin/tensorflow/compiler/mlir/hlo/mlir-hlo-opt+0x14bfaed)
 #2 0x00000000014c041d SignalHandler(int) (bazel-bin/tensorflow/compiler/mlir/hlo/mlir-hlo-opt+0x14c041d)
 #3 0x00007ff0c1bfcdd0 __restore_rt (/lib64/libpthread.so.0+0x12dd0)
 #4 0x00007ff0c164770f raise (/lib64/libc.so.6+0x3770f)
 #5 0x00007ff0c1631b25 abort (/lib64/libc.so.6+0x21b25)
 #6 0x00007ff0c16319f9 _nl_load_domain.cold.0 (/lib64/libc.so.6+0x219f9)
 #7 0x00007ff0c163fcc6 (/lib64/libc.so.6+0x2fcc6)
 #8 0x00000000014526f3 mlir::Operation::isBeforeInBlock(mlir::Operation*) (bazel-bin/tensorflow/compiler/mlir/hlo/mlir-hlo-opt+0x14526f3)
 #9 0x00000000009f0acf _ZN4llvm12function_refIFvPN4mlir9OperationEEE11callback_fnIZNS1_6detail14walkOperationsIZNS1_5lmhlo12_GLOBAL__N_119LhloCopyRemovalPass14runOnOperationEvEUlNS9_6CopyOpEE_SC_vEENSt9enable_ifIXaantsrSt7is_sameIT0_S3_E5valuesrSF_IT1_vE5valueESI_E4typeES3_OT_EUlS3_E_EEvlS3_ (bazel-bin/tensorflow/compiler/mlir/hlo/mlir-hlo-opt+0x9f0acf)
#10 0x00000000014821e7 mlir::detail::walkOperations(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>) (bazel-bin/tensorflow/compiler/mlir/hlo/mlir-hlo-opt+0x14821e7)
#11 0x00000000014821e7 mlir::detail::walkOperations(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>) (bazel-bin/tensorflow/compiler/mlir/hlo/mlir-hlo-opt+0x14821e7)
#12 0x00000000009f073f mlir::lmhlo::(anonymous namespace)::LhloCopyRemovalPass::runOnOperation() (bazel-bin/tensorflow/compiler/mlir/hlo/mlir-hlo-opt+0x9f073f)
#13 0x00000000013d37ce mlir::Pass::run(mlir::Operation*, mlir::AnalysisManager) (bazel-bin/tensorflow/compiler/mlir/hlo/mlir-hlo-opt+0x13d37ce)
#14 0x00000000013d38ba mlir::OpPassManager::run(mlir::Operation*, mlir::AnalysisManager) (bazel-bin/tensorflow/compiler/mlir/hlo/mlir-hlo-opt+0x13d38ba)
#15 0x00000000013da139 mlir::PassManager::run(mlir::ModuleOp) (bazel-bin/tensorflow/compiler/mlir/hlo/mlir-hlo-opt+0x13da139)
#16 0x0000000000c40320 performActions(llvm::raw_ostream&, bool, bool, llvm::SourceMgr&, mlir::MLIRContext*, mlir::PassPipelineCLParser const&) (.constprop.101) (bazel-bin/tensorflow/compiler/mlir/hlo/mlir-hlo-opt+0xc40320)
#17 0x0000000000c40b89 processBuffer(llvm::raw_ostream&, std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer> >, bool, bool, bool, bool, mlir::PassPipelineCLParser const&, mlir::DialectRegistry&) (bazel-bin/tensorflow/compiler/mlir/hlo/mlir-hlo-opt+0xc40b89)
#18 0x0000000000c40cd0 mlir::MlirOptMain(llvm::raw_ostream&, std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer> >, mlir::PassPipelineCLParser const&, mlir::DialectRegistry&, bool, bool, bool, bool, bool) (bazel-bin/tensorflow/compiler/mlir/hlo/mlir-hlo-opt+0xc40cd0)
#19 0x0000000000c4149d mlir::MlirOptMain(int, char**, llvm::StringRef, mlir::DialectRegistry&, bool) (bazel-bin/tensorflow/compiler/mlir/hlo/mlir-hlo-opt+0xc4149d)
#20 0x000000000096a885 main (bazel-bin/tensorflow/compiler/mlir/hlo/mlir-hlo-opt+0x96a885)
#21 0x00007ff0c16336a3 __libc_start_main (/lib64/libc.so.6+0x236a3)
#22 0x000000000096402e _start (bazel-bin/tensorflow/compiler/mlir/hlo/mlir-hlo-opt+0x96402e)
Aborted (core dumped)

@dfki-ehna, @joker-eph

Thanks, Uday. This issue is going to be fixed in general mlir::CopyRemoval pass and it's going to be downstream to Tensorflow soon.

Thanks, Uday. This issue is going to be fixed in general mlir::CopyRemoval pass and it's going to be downstream to Tensorflow soon.

@dfki-ehna I happened to try out the upstream -copy-removal pass (which is btw great to be working just using the copy op interface!) on an lmhlo dialect snippet, and I noticed this missed opportunity (looks like a regression from the earlier pass). I didn't dig further to double-check but looks like the copy should have been eliminated here. I can file an issue if that's the case.

tf-opt -copy-removal test.mlir doesn't remove it.

func @must_be_removed_first(%arg0: memref<2x2xf32>, %arg1: memref<2x2xf32>, %arg2: memref<2x2xf32>) {
    %0 = alloc() {temp = true} : memref<2x2xf32>
    "lmhlo.exponential"(%arg1, %arg2) : (memref<2x2xf32>, memref<2x2xf32>) -> ()
    "lmhlo.exponential"(%arg0, %0) : (memref<2x2xf32>, memref<2x2xf32>) -> ()
    "lmhlo.copy"(%0, %arg2) : (memref<2x2xf32>, memref<2x2xf32>) -> ()
    dealloc %0 : memref<2x2xf32>
    "lmhlo.terminator"() : () -> ()
  }

@bondhugula Thanks for your comment. Actually, in this case, despite the fact that Copy should be removed but copy-removal is not able to remove Copy. Since we didn't have alias analysis nor information on whether the arguments are output buffers or inputs, we tried to be conservative in the copy removal. So, please consider this case where Copy removal makes the logic of the program wrong:

%temp = alloc()
exp(%arg0, %temp)
exp(%arg1, %arg2)
copy(%temp, %arg1)
dealloc %temp

Here, we cannot get rid of temp, since the content of %arg1 would then change for the second exp. Therefore, in the implementation of copy removal for the case "reusing target of the copy as source", we have this line of code for pruning:

if(hasUsersBetween(target, sourceDefiningOp, copy))
   return;

which means if the target value has any users between source defining and copy operations, this removal should not occur. However, the constraint such as this would soon be removed from Copy-Removal by implementing better analysis for it.