FahimF/sd-gui

Slow it/s on m1 Mac

Opened this issue · 7 comments

Everything works nice but the it/s rate is super slow on m1 Mac (Monterey / mbpro). Did you activate the MPS tweaks available?

Screenshot 2022-10-03 at 17 30 59

https://github.com/Birch-san/k-diffusion <= Birch-san offers a MPS enabled k-diffusion branch for m1. please integrate this into your great tool! it will speed up the generation time tremendously.

I get about 1.62it/s on my M1 MBP. The MPS stuff comes from Pytorch so no integration needed. Generally why it would be slow would be if you aren't using the latest Pytorch nightlies. What do you get if you run the following?

pip list | grep torch

The installs look ok, don't they?

(base) ➜  ~ conda activate ml
(ml) ➜  ~ pip list | grep torch
torch                         1.13.0.dev20220924
torchaudio                    0.13.0.dev20220922
torchvision                   0.14.0.dev20220923
(ml) ➜  ~ 

Please find my it/s log below (M1 MBP Base Model)

Last login: Mon Oct  3 18:48:55 on ttys000
(base) ➜  sd-gui git:(main) ✗ conda activate ml
(ml) ➜  sd-gui git:(main) ✗ python app.py
Type: GeneratorType.txt2img
Scheduler: Default
Prompt: Daddy cool, hulk hogan by bosch
Width: 512
Height: 512
Strength: 0.6
Num Stpes: 50
Guidance: 9.9
Copies: 1
Seed: -1
{'trained_betas'} was not found in config. Values will be initialized to default values.
Seed for new image: 8812203425576809384
 16%|██████▉                                     | 8/51 [01:39<11:10, 15.60s/it]


See anything unusual?

Screenshot 2022-10-03 at 18 58 51

Last login: Mon Oct  3 18:56:00 on ttys001
co%                                                                             (base) ➜  sd-gui git:(main) ✗ conda activate ml
(ml) ➜  sd-gui git:(main) ✗ cd sd-gui
cd: no such file or directory: sd-gui
(ml) ➜  sd-gui git:(main) ✗ python app.py
Type: GeneratorType.txt2img
Scheduler: Default
Prompt: Daddy cool, hulk hogan by boschtest
Width: 512
Height: 512
Strength: 0.6
Num Stpes: 50
Guidance: 9.9
Copies: 1
Seed: -1
{'trained_betas'} was not found in config. Values will be initialized to default values.
Seed for new image: 13912555252440801915
 10%|████▎                                       | 5/51 [00:52<08:46, 11.45s/it]

I suppose its some mps problem (as with all m1 solutions haha)

>> This input is larger than your defaults. If you run out of memory, please use a smaller image.
>> Setting Sampler to k_euler_a
/Users/bamboozle/stable-diffusion/InvokeAI/ldm/modules/embedding_manager.py:153: UserWarning: The operator 'aten::nonzero' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at  /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1659484612588/work/aten/src/ATen/mps/MPSFallback.mm:11.)
  placeholder_idx = torch.where(
100%|███████████████████████████████████████████| 20/20 [01:35<00:00,  4.78s/it]
Generating: 100%|████████████████████████████████| 1/1 [01:43<00:00, 103.57s/it]
>> Usage stats:
>>   1 image(s) generated in 103.77s
127.0.0.1 - - [03/Oct/2022 19:09:11] "GET /outputs/img-samples/000001.807159257.png HTTP/1.1" 200 -

here you can see the speed with mps support on the same machine using https://github.com/invoke-ai/InvokeAI and

You do have the right Pytorch version. Anything after 20220924 is slow up to 20220930. I have not tested since then. From what I can tell from the runs you posted, I think they would take somewhere between 200s to 300s for a 20 step run whereas the last one took 103s for a 20 step run. That's correct right?

If so, my guess would be that Hugging Face diffusers might be slower than the other SD implementation you're using. I can only test by switching to the other one but now you've actually given me something to really think about since if there's such a drastic difference in speeds, I should really consider switching. But I'd need to do some testing to make sure that the there is indeed a speed difference. I'll get back to you on this later today ...

You are right, my GUI generates images slower with everything the same except for the Stable Diffusion engine. The same environment, same Python packages etc. Here's a comparison using the same prompt, scheduler, number of steps, and guidance values ...

Here's InovkeAI:
InvokeAI

Here's mine:
sd-gui

Given that the only difference is the SD engine, it does look as if the diffusers based approach is slower, at least at present. So it might be that I'm switching over to the other approach sooner rather than later. Guess I know what I'm doing next weekend 😛

Thanks for your hard work bro!