M1 performance
emmajane1313 opened this issue · 8 comments
Converting the sd-1.5 model from hugging face with the script on m1 mac (running python3 scripts/hf2pyke.py runwayml/stable-diffusion-v1-5 ~/pyke-diffusers-sd15/ ) and getting this traceback:
File "/Users/devdesign/.asdf/installs/python/3.10.6/lib/python3.10/site-packages/accelerate/big_modeling.py", line 215, in dispatch_model
main_device = [d for d in device_map.values() if d not in ["cpu", "disk"]][0]
IndexError: list index out of range
Potentially related to huggingface/accelerate#796. I'll patch hf2pyke to allow bypassing accelerate.
thanks! should i reclone and try ?
Yes, please pull a40d4f9 and run hf2pyke with --no-accelerate:
python3 scripts/hf2pyke.py --no-accelerate runwayml/stable-diffusion-v1-5 ~/pyke-diffusers-sd15/
cool model converted! how long does an image usually take to generate? i've had it loading for about 5 minutes, is that normal?
I'm not sure about the performance on M1. float32 generation on CPU takes around 3 minutes on my Ryzen 5600X and up to, like, 10 minutes on a Xeon E5-2680 v3. ONNX Runtime is not as well optimized for ARM as x86.
If you're looking for the fastest performance on M1 I'd recommend you use HuggingFace Diffusers. HuggingFace has been working hard on MPS support and it's probably your best bet. ONNX Runtime's CoreML backend won't do much for Stable Diffusion, and there's not much else I can do besides writing a custom AI runtime (which tbf I am considering given how many issues ORT has given me...)
Just got to know this project, also interested on running on M1, I'm curious to know if the maybe the rust backend could provide an added performance, also this implementation looks more flexible than current one provided by Apple.
What are the next steps to start the rust backend? Does the requirements.txt need to be adapted to run optimized on M1?
I got the model converted as well.
❯ python3 scripts/hf2pyke.py --no-accelerate runwayml/stable-diffusion-v1-5 ~/pyke-diffusers-sd15/
....
✨ Your model is ready! ~/pyke-diffusers-sd15
I'm curious to know if the maybe the rust backend could provide an added performance, also this implementation looks more flexible than current one provided by Apple.
I'd be happy to introduce optimizations for M1, however not much of the computation is actually done in Rust, most of it is done by ONNX Runtime. Any optimizations I could do would only shave off maybe a few seconds.
I would be open to replacing ONNX Runtime though...
What are the next steps to start the rust backend? Does the requirements.txt need to be adapted to run optimized on M1?
I'm not sure what you mean by this. requirements.txt is only for the hf2pyke script, it is not used by the Rust code.
I added support for the CoreML execution provider in 23a7800. I don't have a Mac so I couldn't do any testing, but from a quick glance it looks like the UNet is mostly comprised of supported operators. Execution should be significantly faster, hopefully under a minute.
To use the CoreML backend, you need to:
- build ONNX Runtime from source with CoreML support - their docs are lacking, so use their standard build guide and pass the
--use_coremlflag tobuild.sh - point
ortto your binaries - set theORT_STRATEGY=systemenv variable, and useORT_LIB_LOCATIONto point to your binaries when building/runningdiffusers - enable the
ort-coremlCargo feature indiffusers - create your pipeline with
devices: DiffusionDeviceControl::All(DiffusionDevice::CoreML)
If you experience any issues please let me know.