Amblyopius/Stable-Diffusion-ONNX-FP16

Do the ONNX models produced this way support batch size above 1?

axodox opened this issue · 4 comments

I am working on a C++ and DirectML based Stable Diffusion app. I have experimented with this tool and generated an ONNX model from Realistic Vision 1.4, with FP16 + autoslicing. The model works fine as long as my batch size 1, if I go above 1, only the first image is good (or even that fails depending on input).

I thought the problem was my code, but I have tried with this model and it works fine with the same code.

Another possibility is that the converted models expect the data for multiple batches a different way. I tried multiple configurations, but at most only the first image was good with the converted models, the others were not working.

Some inputs I have tried for batch size 3:

  • sample [A, A, B, B, C, C,]
  • encoder_hidden_states [ Uncond, Cond ]
  • output expectation [ AU, AC, BU, BC, CU, CC ]

Converted model: first image good, rest are bad
Reference model: fail to run

  • sample [A, A, B, B, C, C,]
  • encoder_hidden_states [ Uncond, Cond, Uncond, Cond, Uncond, Cond ]
  • output expectation [ AU, AC, BU, BC, CU, CC ]

Converted model: first image good, rest are bad
Reference model: all ok

  • sample [A, B, C, A, B, C ]
  • encoder_hidden_states [ Uncond, Uncond, Uncond, Cond, Cond, Cond ]
  • output expectation [ AU, BU, CU, AC, BC, CC ]

Converted model: all fail
Reference model: all ok

Do you have any idea what is the issue?

BTW I have also noticed with FP16 without autoslicing, I cannot generate 512x512 px images with batch size 2, as I run out of memory with 12GB VRAM quite badly. This might be completely normal, but I have noticed that the reference model which uses a lot more resources for batch size 1, it will still run ok for 2, so resource use does not increase this much.

Great tool though! I am glad for it a lot.

Hi, in theory batch size 2 should work if the only thing you are using is the model (some of the custom pipelines do not support batch size above 1 but the diffusers one do). I haven't tried it in a while due the VRAM consumption that came with it.

I think there's differences with the old 1.4 ONNX models because they were from older diffusers versions (0.6.0).

But it should be possible to test it from Python. I'll see if I can verify.

ssube commented

I don't think the ONNX format/conversion has any impact on the batch size, but the pipelines being used might. Using some of the more intensive FP16 optimizations, I've run batch sizes up to 5 on the CUDA provider.

I have tried the same converted network with https://github.com/azuritecoin/OnnxDiffusersUI, the same issue present with batch size 2 and above. So at least now I know that the issue is not my own C++ ONNX pipeline. I will try converting other models.

Ok, I did some investigation, my experience is that any model I convert, be it from HuggingFace, CivitAI, ckpt or safetensors, the resulting model will not work properly with batch size above 1. I tested both in my app, and the one you suggested in the readme. I only tested with autoslicing and fp16, the others will not fit into my GPU memory. All converted models work for batch size 1.

Here is how you can reproduce the issue:

Result: you will get noise instead of images. In my code as I mentioned, I can have either the same result, or I can have the first image good but the others still be noise.

Note: I have since tried to run the ONNX-es generated by the tool in such way that batch size was 4 but all instances of inputs were the same (e.g. submitting the same latent noise four times), even after the first run, I can already see that despite this the output is different for the first latent image, than the others (which are the same). If the network would work properly the same image would be generated 4 times. Here is what I get for your reference:

test0
test1
test2
test3

(In the other tool you recommend in your readme, all images are bad. I get the same result in my app if I exactly use the same input layout.)