SDTurbo - 'MultiHeadAttention_0' Failed to run JSEP kernel
Closed this issue · 11 comments
Hi!
Thank you for this great work.
I'm trying to run SDTurbo with diffusers.js.
I've followed the instructions from this issue to export the model to ONNX.
154. # optimization_options.enable_qordered_matmul = False
155. optimization_options.enable_packed_qkv = False # not supported on webgpu
156. optimization_options.enable_packed_kv = False # not supported on webgpu
python Stable-Diffusion-ONNX-FP16/conv_sd_to_onnx.py \
--model_path "stabilityai/sd-turbo" \
--output_path "./model/sdturbo-fp16" \
--fp16
Full log of the export
2024-01-11 22:52:11.126633: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-11 22:52:11.126680: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-11 22:52:11.128271: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-11 22:52:13.292449: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Loading pipeline components...: 100% 5/5 [00:42<00:00, 8.53s/it]
You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing `safety_checker=None`. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 .
/usr/local/lib/python3.10/dist-packages/transformers/modeling_attn_mask_utils.py:66: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if input_shape[-1] > 1 or self.sliding_window is not None:
/usr/local/lib/python3.10/dist-packages/transformers/modeling_attn_mask_utils.py:137: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if past_key_values_length > 0:
/usr/local/lib/python3.10/dist-packages/transformers/models/clip/modeling_clip.py:273: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
/usr/local/lib/python3.10/dist-packages/transformers/models/clip/modeling_clip.py:281: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if causal_attention_mask.size() != (bsz, 1, tgt_len, src_len):
/usr/local/lib/python3.10/dist-packages/transformers/models/clip/modeling_clip.py:313: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):
/usr/local/lib/python3.10/dist-packages/torch/onnx/symbolic_opset9.py:5856: UserWarning: Exporting aten::index operator of advanced indexing in opset 17 is achieved by combination of multiple ONNX operators, including Reshape, Transpose, Concat, and Gather. If indices include negative values, the exported graph will produce incorrect results.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/diffusers/models/unet_2d_condition.py:915: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if dim % default_overall_up_factor != 0:
/usr/local/lib/python3.10/dist-packages/diffusers/models/downsampling.py:135: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert hidden_states.shape[1] == self.channels
/usr/local/lib/python3.10/dist-packages/diffusers/models/downsampling.py:144: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert hidden_states.shape[1] == self.channels
/usr/local/lib/python3.10/dist-packages/diffusers/models/upsampling.py:149: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert hidden_states.shape[1] == self.channels
/usr/local/lib/python3.10/dist-packages/diffusers/models/upsampling.py:165: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if hidden_states.shape[0] >= 64:
/usr/local/lib/python3.10/dist-packages/diffusers/models/unet_2d_condition.py:1206: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if not return_dict:
/usr/local/lib/python3.10/dist-packages/diffusers/models/autoencoders/autoencoder_kl.py:265: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if not return_dict:
/usr/local/lib/python3.10/dist-packages/torch/onnx/_internal/jit_utils.py:307: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at ../torch/csrc/jit/passes/onnx/constant_fold.cpp:179.)
_C._jit_pass_onnx_node_shape_type_inference(node, params_dict, opset_version)
/usr/local/lib/python3.10/dist-packages/torch/onnx/utils.py:702: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at ../torch/csrc/jit/passes/onnx/constant_fold.cpp:179.)
_C._jit_pass_onnx_graph_shape_type_inference(
/usr/local/lib/python3.10/dist-packages/torch/onnx/utils.py:1209: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at ../torch/csrc/jit/passes/onnx/constant_fold.cpp:179.)
_C._jit_pass_onnx_graph_shape_type_inference(
/usr/local/lib/python3.10/dist-packages/diffusers/models/autoencoders/autoencoder_kl.py:306: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if not return_dict:
2024-01-11 23:02:48.140679604 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 1 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2024-01-11 23:02:48.143225771 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-01-11 23:02:48.143247983 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2024-01-11 23:02:58.092696522 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-01-11 23:02:58.092735644 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
ONNX pipeline saved to model/sdturbo-fp16
Loading pipeline components...: 0% 0/6 [00:00<?, ?it/s]2024-01-11 23:03:10.174160414 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 1 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2024-01-11 23:03:10.178587318 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-01-11 23:03:10.178615811 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
Loading pipeline components...: 33% 2/6 [00:00<00:01, 2.19it/s]2024-01-11 23:03:11.979303480 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 1 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2024-01-11 23:03:11.983210207 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-01-11 23:03:11.983247143 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
Loading pipeline components...: 67% 4/6 [00:02<00:01, 1.85it/s]2024-01-11 23:03:16.868251774 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 3 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2024-01-11 23:03:16.881676989 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-01-11 23:03:16.881703685 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
Loading pipeline components...: 83% 5/6 [00:07<00:02, 2.13s/it]2024-01-11 23:03:17.788958820 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-01-11 23:03:17.788983933 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
Loading pipeline components...: 100% 6/6 [00:16<00:00, 2.79s/it]
ONNX pipeline is loadable
Everything seems to export and load properly in the browser with webgpu
. And I'm also able to run the text-encoder & vae-decoder of the exported model with webgpu without issue.
However, when I try to run a step of the unet, I get this error:
ort.webgpu.min.js:10 Uncaught (in promise) Error: failed to call OrtRun(). ERROR_CODE: 1, ERROR_MESSAGE: Non-zero status code returned while running MultiHeadAttention node. Name:'MultiHeadAttention_0' Status Message: Failed to run JSEP kernel
at t.checkLastError (ort.webgpu.min.js:10:491501)
at t.run (ort.webgpu.min.js:10:486314)
at async t.OnnxruntimeWebAssemblySessionHandler.run (ort.webgpu.min.js:10:477016)
at async a.run (ort.webgpu.min.js:10:1152723)
...
It's not clear why this operator fails as it seems supported & running fine in sd21. Is it a known issue? Any pointer would be welcome!
Hmm it turns out I get the same issue with aislamov/lcm-dreamshaper-v7-onnx
on my setup, whereas it works fine in the example.
Some random fact: if I concatenate the latents to get a shape [2, 4, 64, 64]
, I don't get the error but the unet returns NaNs.
EDIT: This was some unrelated error, but was still outputing the same error message
It looks like there's at least one issue with MultiHeadAttention kernel. There's a temp fix for that which runs unet twice for prompt and negative prompt https://github.com/gfodor/diffusers.js/tree/optimum-sdxl
If you can upload your converted model to huggingface, i think i'll find the solution quicker
Thank you for the quick feedback! I'll give it a try.
Here's the exported model: https://huggingface.co/cyrildiagne/sdturbo-onnx
Hmm, I cleared my indexedDB and the problem is gone. So I guess I was using an early, broken export.
Thanks for the help!
@cyrildiagne Hello, I saw that you were able to implement SD Turbo. Nice work!
I was trying to do the same thing but I got stuck on a similar error when running the UNET model which was this:
I was able to avoid this error by switching the backend to WASM but when I did that the UNET gave me NANs as output and the final image was a black image.
I used the following command to convert the model to ONNX:
python conv_sd_to_onnx.py --model_path "stabilityai/sd-turbo" --output_path "./converted_models/sd-turbo-onnx" --attention-slicing "auto" --ckpt-upcast-attention --fp16
My converted model can be found here
Other relevant information:
- I used the PNDM Scheduler as well as an implementation of the EulerDiscrete Scheduler.
- Only the prompt is used as input like so
const promptEmbeds = await this.encodePrompt(input.prompt)
. The negative prompt is ignored. - The Guidance Scale is set to 0.
Not sure if my issue is with the converted model or something else but I wanted to know if it was possible if you could help me with this. Any assistance would be appreciated. Thank you!
Hi @jdp8 , do you also have the problem when using https://huggingface.co/cyrildiagne/sdturbo-onnx ?
I used your model and when I run it with 2 steps or less I get this image (still noisy):
But when I run 3 steps or more I get a clearer image:
However, if I use the prompt and an empty negative prompt in the embeddings like so const promptEmbeds = await this.getPromptEmbeds(input.prompt, '')
or if I concatenate the latents, I get the MultiHeadAttention error:
Yes this model doesn't do CFG so it expects no negative prompt so the const promptEmbeds = await this.encodePrompt(input.prompt)
is correct.
I assume that the issue with steps<=2 comes from using the PNDM scheduler? I plan to do a PR that adds SDTurbo support along with my EulerDiscreteScheduler implementation. But have to re-adapt it a bit to this codebase first since I diverged a bit from it during experimentation. In the meantime here's the step function in case it's helpful:
step(
modelOutput: Tensor,
timestep: number,
sample: Tensor,
s_churn: number = 0.0,
s_tmin: number = 0.0,
s_tmax: number = Infinity,
s_noise: number = 1.0
) {
if (this.numInferenceSteps === null) {
throw new Error(
"Number of inference steps is 'null', you need to run 'setTimesteps' after creating the scheduler"
)
}
const sigma = this.sigmas.data[this.stepIndex]
// Get gama with the equivalent of this python code
let gamma = 0.0
if (s_tmin <= sigma && sigma <= s_tmax) {
gamma = Math.min(
s_churn / (this.sigmas.data.length - 1),
Math.sqrt(2) - 1
)
}
const noise = randomNormalTensor(modelOutput.dims)
const eps = noise.mul(s_noise)
const sigma_hat = sigma * (gamma + 1)
if (gamma > 0) {
sample = sample.add(eps.mul(sigma_hat ** 2 - sigma ** 2).sqrt())
}
// # 1. compute predicted original sample (x_0) from sigma-scaled predicted noise
// # config.prediction_type == "epsilon":
const denoised = sample.sub(modelOutput.mul(sigma_hat))
// 2. Convert to an ODE derivative
const derivative = sample.sub(denoised).div(sigma_hat)
const dt = this.sigmas.data[this.stepIndex + 1] - sigma_hat
const prevSample = sample.add(derivative.mul(dt))
this.stepIndex++
return prevSample
}
Yes you are correct, I was using the PNDM Scheduler instead of my implementation of the EulerDiscreteScheduler. But there must be something wrong with my implementation of the scheduler because I get a black image at the end. More than likely the issue is in my setTimesteps()
function.
I'll wait for your PR so as not to take more time from you. Thank you for your assistance, very much appreciated!
@jdp8 I've started a draft PR here. It's still WIP & needs cleaning but the relevant code should be there (it works in the react example app)
Awesome, thank you so much! I'll test it out and look into it to see if I can help with anything else.