(ROCM) Failed to distribute matmul in sdxl-turbo unet

Question

(ROCM) Failed to distribute matmul in sdxl-turbo unet

monorimet opened this issue a month ago · 1 comments

What happened?

Module-level error:

<unknown>:0: error: cannot get concrete layout for contraction
<stdin>:2704:12: error: 'func.func' op failed to distribute
    %445 = torch.aten.convolution %338, %439, %440, %441, %442, %443, %false_465, %444, %int1_468 : !torch.vtensor<[1,320,64,64],f16>, !torch.vtensor<[640,320,1,1],f16>, !torch.vtensor<[640],f16>, !torch.list<int>, !torch.list<int>, !torch.list<int>, !torch.bool, !torch.list<int>, !torch.int -> !torch.vtensor<[1,640,64,64],f16>
           ^
<stdin>:1718:10: note: called from
    %6 = call @forward(%0, %1, %2, %3, %4, %5) : (!torch.vtensor<[1,4,128,128],f16>, !torch.vtensor<[1,64,2048],f16>, !torch.vtensor<[1,1280],f16>, !torch.vtensor<[1,6],f16>, !torch.vtensor<[1],f16>, !torch.vtensor<[1],si64>) -> !torch.vtensor<[1,4,128,128],f16>
         ^
<stdin>:2704:12: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"rocm", "rocm-hsaco-fb", {mma_intrinsics = [#iree_gpu.mma_layout<MFMA_F16_16x16x16_F32>, #iree_gpu.mma_layout<MFMA_F16_32x32x8_F32>], target_arch = "gfx942", ukernels = "none", waves_per_eu = 2 : i64}>
    %445 = torch.aten.convolution %338, %439, %440, %441, %442, %443, %false_465, %444, %int1_468 : !torch.vtensor<[1,320,64,64],f16>, !torch.vtensor<[640,320,1,1],f16>, !torch.vtensor<[640],f16>, !torch.list<int>, !torch.list<int>, !torch.list<int>, !torch.bool, !torch.list<int>, !torch.int -> !torch.vtensor<[1,640,64,64],f16>
           ^
<stdin>:1718:10: note: called from
    %6 = call @forward(%0, %1, %2, %3, %4, %5) : (!torch.vtensor<[1,4,128,128],f16>, !torch.vtensor<[1,64,2048],f16>, !torch.vtensor<[1,1280],f16>, !torch.vtensor<[1,6],f16>, !torch.vtensor<[1],f16>, !torch.vtensor<[1],si64>) -> !torch.vtensor<[1,4,128,128],f16>
         ^

Dispatch-level Error:

iree.compiler.tools.binaries.CompilerToolError: Error invoking IREE compiler tool iree-compile
Error code: 1
Diagnostics:
failed to translate executables
/home/monorimet/SHARK/models/stabilityai_sdxl_turbo_1_64_1024x1024_fp16_gfx942/EulerDiscrete_unet_4_dispatches/configured_compiled_scheduled_unet_run_forward$async_dispatch_50.mlir:26:10: error: cannot get concrete layout for contraction
          %11 = arith.addf %out, %10 : f32 loc("/home/monorimet/SHARK/models/stabilityai_sdxl_turbo_1_64_1024x1024_fp16_gfx942/EulerDiscrete_unet_4_dispatches/compiled_scheduled_unet_run_forward$async_dispatch_50.mlir":26:10)
         ^
/home/monorimet/SHARK/models/stabilityai_sdxl_turbo_1_64_1024x1024_fp16_gfx942/EulerDiscrete_unet_4_dispatches/configured_compiled_scheduled_unet_run_forward$async_dispatch_50.mlir:9:6: error: 'func.func' op failed to distribute
      func.func @run_forward$async_dispatch_50_matmul_like_64x64x640x320_f16xf16xf32() attributes {translation_info = #iree_codegen.translation_info<LLVMGPUVectorDistribute workgroup_size = [128, 2, 1] subgroup_size = 64, {mma_schedule = #iree_gpu.mma_schedule<intrinsic = #iree_gpu.mma_layout<MFMA_F16_16x16x16_F32>, subgroup_m_count = 2, subgroup_n_count = 2>}>} {
     ^
/home/monorimet/SHARK/models/stabilityai_sdxl_turbo_1_64_1024x1024_fp16_gfx942/EulerDiscrete_unet_4_dispatches/configured_compiled_scheduled_unet_run_forward$async_dispatch_50.mlir:2:2: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"rocm", "rocm-hsaco-fb", {mma_intrinsics = [#iree_gpu.mma_layout<MFMA_F16_16x16x16_F32>, #iree_gpu.mma_layout<MFMA_F16_32x32x8_F32>], target_arch = "gfx942", ukernels = "none", waves_per_eu = 2 : i64}>
  hal.executable.variant public @rocm_hsaco_fb target(<"rocm", "rocm-hsaco-fb", {mma_intrinsics = [#iree_gpu.mma_layout<MFMA_F16_16x16x16_F32>, #iree_gpu.mma_layout<MFMA_F16_32x32x8_F32>], target_arch = "gfx942", ukernels = "none", waves_per_eu = 2 : i64}>) {
 ^

Reproducer:
attention_and_matmul_spec_mfma.mlir
compiled_scheduled_unet_run_forward$async_dispatch_50.mlir

Invoked with:
 iree-compile --iree-input-type=torch --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=rocm --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --iree-hal-target-backends=rocm --iree-rocm-target-chip=gfx942 --iree-opt-const-eval=false --iree-rocm-waves-per-eu=2 --iree-hal-dump-executable-files-to=/home/monorimet/SHARK/models/stabilityai_sdxl_turbo_1_64_1024x1024_fp16_gfx942/EulerDiscrete_unet_4_dispatches --iree-flow-enable-aggressive-fusion --iree-global-opt-enable-fuse-horizontal-contractions=true --iree-opt-aggressively-propagate-transposes=true --iree-global-opt-propagate-transposes=true --iree-opt-outer-dim-concat=true --iree-vm-target-truncate-unsupported-floats --iree-llvmgpu-enable-prefetch=true --iree-opt-data-tiling=false --iree-codegen-gpu-native-math-precision=true --iree-rocm-waves-per-eu=2 --iree-codegen-llvmgpu-use-vector-distribution=true --iree-preprocessing-pass-pipeline=builtin.module(iree-preprocessing-transpose-convolution-pipeline, util.func(iree-preprocessing-pad-to-intrinsics)) --iree-codegen-transform-dialect-library=/home/monorimet/SHARK/models/stabilityai_sdxl_turbo_1_64_1024x1024_fp16_gfx942/attention_and_matmul_spec_mfma.mlir --compile-from=flow compiled_scheduled_unet_run_forward\$async_dispatch_50.mlir

Steps to reproduce your issue

No response

What component(s) does this issue relate to?

Compiler

Version information

commit 2a2a4d0 (HEAD -> main, origin/main, origin/HEAD)

Additional context

This is a new version of the IR, where batch dim is 1. Hence, we likely have a vector distribution issue for this shape of this op.

Answer 1 · 2024-05-24T00:42:27.000Z

FWIW this also fails with vector distribution explicitly disabled:, same dispatch:

<unknown>:0: error: LLVM Translation failed for operation: builtin.unrealized_conversion_cast
/home/eagarvey/SHARK/models/stabilityai_sdxl_turbo_1_64_1024x1024_fp16_gfx942/EulerDiscrete_unet_4_dispatches/configured_compiled_scheduled_unet_run_forward$async_dispatch_50.mlir:2:2: error: failed to translate the MLIR LLVM dialect to the native llvm::Module
  hal.executable.variant public @rocm_hsaco_fb target(<"rocm", "rocm-hsaco-fb", {mma_intrinsics = [#iree_gpu.mma_layout<MFMA_F16_16x16x16_F32>, #iree_gpu.mma_layout<MFMA_F16_32x32x8_F32>], target_arch = "gfx942", ukernels = "none", waves_per_eu = 2 : i64}>) {
 ^
/home/eagarvey/SHARK/models/stabilityai_sdxl_turbo_1_64_1024x1024_fp16_gfx942/EulerDiscrete_unet_4_dispatches/configured_compiled_scheduled_unet_run_forward$async_dispatch_50.mlir:2:2: error: failed to serialize executable for target backend rocm
  hal.executable.variant public @rocm_hsaco_fb target(<"rocm", "rocm-hsaco-fb", {mma_intrinsics = [#iree_gpu.mma_layout<MFMA_F16_16x16x16_F32>, #iree_gpu.mma_layout<MFMA_F16_32x32x8_F32>], target_arch = "gfx942", ukernels = "none", waves_per_eu = 2 : i64}>) {
 ^
/home/eagarvey/SHARK/models/stabilityai_sdxl_turbo_1_64_1024x1024_fp16_gfx942/EulerDiscrete_unet_4_dispatches/configured_compiled_scheduled_unet_run_forward$async_dispatch_50.mlir:1:0: error: failed to serialize executables