M1 performance

Question

M1 performance

emmajane1313 opened this issue 3 years ago · 8 comments

Converting the sd-1.5 model from hugging face with the script on m1 mac (running python3 scripts/hf2pyke.py runwayml/stable-diffusion-v1-5 ~/pyke-diffusers-sd15/ ) and getting this traceback:

  File "/Users/devdesign/.asdf/installs/python/3.10.6/lib/python3.10/site-packages/accelerate/big_modeling.py", line 215, in dispatch_model
    main_device = [d for d in device_map.values() if d not in ["cpu", "disk"]][0]
IndexError: list index out of range

Answer 1 · 2023-01-17T18:43:46.000Z

Potentially related to huggingface/accelerate#796. I'll patch hf2pyke to allow bypassing accelerate.

Answer 2 · 2023-01-17T18:58:25.000Z

thanks! should i reclone and try ?

Answer 3 · 2023-01-17T19:09:07.000Z

Yes, please pull a40d4f9 and run hf2pyke with --no-accelerate:

python3 scripts/hf2pyke.py --no-accelerate runwayml/stable-diffusion-v1-5 ~/pyke-diffusers-sd15/

Answer 4 · 2023-01-17T19:34:38.000Z

cool model converted! how long does an image usually take to generate? i've had it loading for about 5 minutes, is that normal?

Answer 5 · 2023-01-17T19:46:27.000Z

I'm not sure about the performance on M1. float32 generation on CPU takes around 3 minutes on my Ryzen 5600X and up to, like, 10 minutes on a Xeon E5-2680 v3. ONNX Runtime is not as well optimized for ARM as x86.

If you're looking for the fastest performance on M1 I'd recommend you use HuggingFace Diffusers. HuggingFace has been working hard on MPS support and it's probably your best bet. ONNX Runtime's CoreML backend won't do much for Stable Diffusion, and there's not much else I can do besides writing a custom AI runtime (which tbf I am considering given how many issues ORT has given me...)

Answer 6 · 2023-01-17T20:02:38.000Z

Just got to know this project, also interested on running on M1, I'm curious to know if the maybe the rust backend could provide an added performance, also this implementation looks more flexible than current one provided by Apple.

What are the next steps to start the rust backend? Does the requirements.txt need to be adapted to run optimized on M1?

I got the model converted as well.

❯ python3 scripts/hf2pyke.py --no-accelerate runwayml/stable-diffusion-v1-5 ~/pyke-diffusers-sd15/
....

✨ Your model is ready! ~/pyke-diffusers-sd15

Answer 7 · 2023-01-17T20:13:17.000Z

I'm curious to know if the maybe the rust backend could provide an added performance, also this implementation looks more flexible than current one provided by Apple.

I'd be happy to introduce optimizations for M1, however not much of the computation is actually done in Rust, most of it is done by ONNX Runtime. Any optimizations I could do would only shave off maybe a few seconds.

I would be open to replacing ONNX Runtime though...

What are the next steps to start the rust backend? Does the requirements.txt need to be adapted to run optimized on M1?

I'm not sure what you mean by this. requirements.txt is only for the hf2pyke script, it is not used by the Rust code.

Answer 8 · 2023-01-26T00:51:37.000Z

I added support for the CoreML execution provider in 23a7800. I don't have a Mac so I couldn't do any testing, but from a quick glance it looks like the UNet is mostly comprised of supported operators. Execution should be significantly faster, hopefully under a minute.

To use the CoreML backend, you need to:

build ONNX Runtime from source with CoreML support - their docs are lacking, so use their standard build guide and pass the --use_coreml flag to build.sh
point ort to your binaries - set the ORT_STRATEGY=system env variable, and use ORT_LIB_LOCATION to point to your binaries when building/running diffusers
enable the ort-coreml Cargo feature in diffusers
create your pipeline with devices: DiffusionDeviceControl::All(DiffusionDevice::CoreML)

If you experience any issues please let me know.