SDTurbo - 'MultiHeadAttention_0' Failed to run JSEP kernel

Question

SDTurbo - 'MultiHeadAttention_0' Failed to run JSEP kernel

Closed this issue 10 months ago · 11 comments

Hi!

Thank you for this great work.

I'm trying to run SDTurbo with diffusers.js.

I've followed the instructions from this issue to export the model to ONNX.

154. # optimization_options.enable_qordered_matmul = False
155. optimization_options.enable_packed_qkv = False # not supported on webgpu
156. optimization_options.enable_packed_kv = False # not supported on webgpu

 python Stable-Diffusion-ONNX-FP16/conv_sd_to_onnx.py \
 --model_path "stabilityai/sd-turbo" \
 --output_path "./model/sdturbo-fp16"  \
 --fp16

Full log of the export

2024-01-11 22:52:11.126633: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-11 22:52:11.126680: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-11 22:52:11.128271: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-11 22:52:13.292449: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Loading pipeline components...: 100% 5/5 [00:42<00:00,  8.53s/it]
You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing `safety_checker=None`. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 .
/usr/local/lib/python3.10/dist-packages/transformers/modeling_attn_mask_utils.py:66: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if input_shape[-1] > 1 or self.sliding_window is not None:
/usr/local/lib/python3.10/dist-packages/transformers/modeling_attn_mask_utils.py:137: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if past_key_values_length > 0:
/usr/local/lib/python3.10/dist-packages/transformers/models/clip/modeling_clip.py:273: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
/usr/local/lib/python3.10/dist-packages/transformers/models/clip/modeling_clip.py:281: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if causal_attention_mask.size() != (bsz, 1, tgt_len, src_len):
/usr/local/lib/python3.10/dist-packages/transformers/models/clip/modeling_clip.py:313: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):
/usr/local/lib/python3.10/dist-packages/torch/onnx/symbolic_opset9.py:5856: UserWarning: Exporting aten::index operator of advanced indexing in opset 17 is achieved by combination of multiple ONNX operators, including Reshape, Transpose, Concat, and Gather. If indices include negative values, the exported graph will produce incorrect results.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/diffusers/models/unet_2d_condition.py:915: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if dim % default_overall_up_factor != 0:
/usr/local/lib/python3.10/dist-packages/diffusers/models/downsampling.py:135: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert hidden_states.shape[1] == self.channels
/usr/local/lib/python3.10/dist-packages/diffusers/models/downsampling.py:144: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert hidden_states.shape[1] == self.channels
/usr/local/lib/python3.10/dist-packages/diffusers/models/upsampling.py:149: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert hidden_states.shape[1] == self.channels
/usr/local/lib/python3.10/dist-packages/diffusers/models/upsampling.py:165: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if hidden_states.shape[0] >= 64:
/usr/local/lib/python3.10/dist-packages/diffusers/models/unet_2d_condition.py:1206: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if not return_dict:
/usr/local/lib/python3.10/dist-packages/diffusers/models/autoencoders/autoencoder_kl.py:265: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if not return_dict:
/usr/local/lib/python3.10/dist-packages/torch/onnx/_internal/jit_utils.py:307: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at ../torch/csrc/jit/passes/onnx/constant_fold.cpp:179.)
  _C._jit_pass_onnx_node_shape_type_inference(node, params_dict, opset_version)
/usr/local/lib/python3.10/dist-packages/torch/onnx/utils.py:702: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at ../torch/csrc/jit/passes/onnx/constant_fold.cpp:179.)
  _C._jit_pass_onnx_graph_shape_type_inference(
/usr/local/lib/python3.10/dist-packages/torch/onnx/utils.py:1209: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at ../torch/csrc/jit/passes/onnx/constant_fold.cpp:179.)
  _C._jit_pass_onnx_graph_shape_type_inference(
/usr/local/lib/python3.10/dist-packages/diffusers/models/autoencoders/autoencoder_kl.py:306: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if not return_dict:
2024-01-11 23:02:48.140679604 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 1 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2024-01-11 23:02:48.143225771 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-01-11 23:02:48.143247983 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2024-01-11 23:02:58.092696522 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-01-11 23:02:58.092735644 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
ONNX pipeline saved to model/sdturbo-fp16
Loading pipeline components...:   0% 0/6 [00:00<?, ?it/s]2024-01-11 23:03:10.174160414 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 1 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2024-01-11 23:03:10.178587318 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-01-11 23:03:10.178615811 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
Loading pipeline components...:  33% 2/6 [00:00<00:01,  2.19it/s]2024-01-11 23:03:11.979303480 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 1 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2024-01-11 23:03:11.983210207 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-01-11 23:03:11.983247143 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
Loading pipeline components...:  67% 4/6 [00:02<00:01,  1.85it/s]2024-01-11 23:03:16.868251774 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 3 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2024-01-11 23:03:16.881676989 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-01-11 23:03:16.881703685 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
Loading pipeline components...:  83% 5/6 [00:07<00:02,  2.13s/it]2024-01-11 23:03:17.788958820 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-01-11 23:03:17.788983933 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
Loading pipeline components...: 100% 6/6 [00:16<00:00,  2.79s/it]
ONNX pipeline is loadable

Everything seems to export and load properly in the browser with webgpu. And I'm also able to run the text-encoder & vae-decoder of the exported model with webgpu without issue.

However, when I try to run a step of the unet, I get this error:

ort.webgpu.min.js:10 Uncaught (in promise) Error: failed to call OrtRun(). ERROR_CODE: 1, ERROR_MESSAGE: Non-zero status code returned while running MultiHeadAttention node. Name:'MultiHeadAttention_0' Status Message: Failed to run JSEP kernel
    at t.checkLastError (ort.webgpu.min.js:10:491501)
    at t.run (ort.webgpu.min.js:10:486314)
    at async t.OnnxruntimeWebAssemblySessionHandler.run (ort.webgpu.min.js:10:477016)
    at async a.run (ort.webgpu.min.js:10:1152723)
    ...

It's not clear why this operator fails as it seems supported & running fine in sd21. Is it a known issue? Any pointer would be welcome!

Answer 1 · 2024-01-12T09:25:34.000Z

~~Hmm it turns out I get the same issue with aislamov/lcm-dreamshaper-v7-onnx on my setup, whereas it works fine in the example.~~

~~Some random fact: if I concatenate the latents to get a shape [2, 4, 64, 64], I don't get the error but the unet returns NaNs.~~

EDIT: This was some unrelated error, but was still outputing the same error message

Answer 2 · 2024-01-12T18:19:25.000Z

It looks like there's at least one issue with MultiHeadAttention kernel. There's a temp fix for that which runs unet twice for prompt and negative prompt https://github.com/gfodor/diffusers.js/tree/optimum-sdxl
If you can upload your converted model to huggingface, i think i'll find the solution quicker

Answer 3 · 2024-01-13T06:34:55.000Z

Thank you for the quick feedback! I'll give it a try.

Here's the exported model: https://huggingface.co/cyrildiagne/sdturbo-onnx

Answer 4 · 2024-01-13T17:50:12.000Z

Hmm, I cleared my indexedDB and the problem is gone. So I guess I was using an early, broken export.
Thanks for the help!

Answer 5 · 2024-01-16T16:41:33.000Z

@cyrildiagne Hello, I saw that you were able to implement SD Turbo. Nice work!

I was trying to do the same thing but I got stuck on a similar error when running the UNET model which was this:

I was able to avoid this error by switching the backend to WASM but when I did that the UNET gave me NANs as output and the final image was a black image.

I used the following command to convert the model to ONNX:

python conv_sd_to_onnx.py --model_path "stabilityai/sd-turbo" --output_path "./converted_models/sd-turbo-onnx" --attention-slicing "auto" --ckpt-upcast-attention --fp16

My converted model can be found here

Other relevant information:

I used the PNDM Scheduler as well as an implementation of the EulerDiscrete Scheduler.
Only the prompt is used as input like so const promptEmbeds = await this.encodePrompt(input.prompt). The negative prompt is ignored.
The Guidance Scale is set to 0.

Not sure if my issue is with the converted model or something else but I wanted to know if it was possible if you could help me with this. Any assistance would be appreciated. Thank you!

Answer 6 · 2024-01-16T18:23:46.000Z

Hi @jdp8 , do you also have the problem when using https://huggingface.co/cyrildiagne/sdturbo-onnx ?

Answer 7 · 2024-01-17T14:06:43.000Z

I used your model and when I run it with 2 steps or less I get this image (still noisy):

But when I run 3 steps or more I get a clearer image:

However, if I use the prompt and an empty negative prompt in the embeddings like so const promptEmbeds = await this.getPromptEmbeds(input.prompt, '') or if I concatenate the latents, I get the MultiHeadAttention error:

Answer 8 · 2024-01-17T14:13:07.000Z

Yes this model doesn't do CFG so it expects no negative prompt so the const promptEmbeds = await this.encodePrompt(input.prompt) is correct.

I assume that the issue with steps<=2 comes from using the PNDM scheduler? I plan to do a PR that adds SDTurbo support along with my EulerDiscreteScheduler implementation. But have to re-adapt it a bit to this codebase first since I diverged a bit from it during experimentation. In the meantime here's the step function in case it's helpful:

step(
    modelOutput: Tensor,
    timestep: number,
    sample: Tensor,
    s_churn: number = 0.0,
    s_tmin: number = 0.0,
    s_tmax: number = Infinity,
    s_noise: number = 1.0
  ) {
    if (this.numInferenceSteps === null) {
      throw new Error(
        "Number of inference steps is 'null', you need to run 'setTimesteps' after creating the scheduler"
      )
    }

    const sigma = this.sigmas.data[this.stepIndex]

    // Get gama with the equivalent of this python code
    let gamma = 0.0
    if (s_tmin <= sigma && sigma <= s_tmax) {
      gamma = Math.min(
        s_churn / (this.sigmas.data.length - 1),
        Math.sqrt(2) - 1
      )
    }

    const noise = randomNormalTensor(modelOutput.dims)

    const eps = noise.mul(s_noise)
    const sigma_hat = sigma * (gamma + 1)

    if (gamma > 0) {
      sample = sample.add(eps.mul(sigma_hat ** 2 - sigma ** 2).sqrt())
    }

    // # 1. compute predicted original sample (x_0) from sigma-scaled predicted noise
    // # config.prediction_type == "epsilon":
    const denoised = sample.sub(modelOutput.mul(sigma_hat))

    // 2. Convert to an ODE derivative
    const derivative = sample.sub(denoised).div(sigma_hat)

    const dt = this.sigmas.data[this.stepIndex + 1] - sigma_hat

    const prevSample = sample.add(derivative.mul(dt))

    this.stepIndex++

    return prevSample
  }

Answer 9 · 2024-01-17T14:39:43.000Z

Yes you are correct, I was using the PNDM Scheduler instead of my implementation of the EulerDiscreteScheduler. But there must be something wrong with my implementation of the scheduler because I get a black image at the end. More than likely the issue is in my setTimesteps() function.

I'll wait for your PR so as not to take more time from you. Thank you for your assistance, very much appreciated!

Answer 10 · 2024-01-20T11:38:36.000Z

@jdp8 I've started a draft PR here. It's still WIP & needs cleaning but the relevant code should be there (it works in the react example app)

Answer 11 · 2024-01-23T18:07:17.000Z

Awesome, thank you so much! I'll test it out and look into it to see if I can help with anything else.