CoreML Model Loading and Caching ("First Time" Init) everytime?
Opened this issue · 3 comments
When building using -F coreml and running the audio_transcription example, you see the following message
whisper_init_state: kv self size = 42.00 MB
whisper_init_state: kv cross size = 140.62 MB
whisper_init_state: loading Core ML model from '/Users/g/.ggml-models/ggml-medium-encoder.mlmodelc'
whisper_init_state: first run on a device may take a while ...
whisper_init_state: Core ML model loaded
And then one needs to wait for an agonizing amount of time (3.5h for the medium model on my M1 Mac Mini) for the first time load.
After this, it's noticeably faster as compared to CPU only execution.
However the catch comes in when I realized that this seems to be per-process (which I did not expect), and per-model (which I did).
So if I make some changes to the code and recompile, the "first-time run" will again take a while.
To me this seems to be a consequence of not passing some kind of cache-hint to the Apple XCode CoreML ANE Compiler Service or whatever, surely there must be some way to make it remember this cached model and to avoid having to recompute this?
Looking at the Apple official documentation at https://github.com/apple/ml-stable-diffusion#faq I see the following:
Q5: Every time I generate an image using the Python pipeline, loading all the Core ML models takes 2-3 minutes. Is this expected?
A5: Yes and using the Swift library reduces this to just a few seconds. The reason is that coremltools loads Core ML models (.mlpackage) and each model is compiled to be run on the requested compute unit during load time. Because of the size and number of operations of the unet model, it takes around 2-3 minutes to compile it for Neural Engine execution. Other models should take at most a few seconds. Note that coremltools does not cache the compiled model for later loads so each load takes equally long. In order to benefit from compilation caching, StableDiffusion Swift package by default relies on compiled Core ML models (.mlmodelc) which will be compiled down for the requested compute unit upon first load but then the cache will be reused on subsequent loads until it is purged due to lack of use.
If you intend to use the Python pipeline in an application, we recommend initializing the pipeline once so that the load time is only incurred once. Afterwards, generating images using different prompts and random seeds will not incur the load time for the current session of your application.
"and using the Swift library reduces this to just a few seconds", it seems absurd for the CoreML pipeline caching feature to be dependent on what programming language you call this from. Can I not use this within Rust then?
I don't have any experience with macOS at all, so I'm not sure about this at all. Hopefully someone else with more experience can shine a light on this.
I believe this is a processor-dependent implementation issue on Apple's side, not a programming language API issue. If you look at the execution output in your original post, it shows it is loading ggml-medium-encoder.mlmodelc
, which is the compiled Core ML model.
On my Mac Mini M2 Pro, running the medium model behaves as expected - it takes a minute or so the first time, then just a few seconds afterwards. Future runs usually load quickly even if I make changes to the calling process.
On the other hand, on my MacBook Air M1, the small model works as expected but the medium model usually hangs indefinitely after the "first run on a device may take a while ...". I usually don't wait long enough to find out if will eventually load.
In ggerganov/whisper.cpp#773, a user notes that force-quitting ANECompilerService
will allow the load to continue and execute. This works for me but doesn't really solve the problem - the next time I run the binary, I get the hang during load again.
My understanding is that on initial load, ANECompilerService will load the compiled CoreML model and specialize if for the specific processor being used and cache that for future use. I am guessing that for each Apple Silicon processor, there is a maximum model size above which this just doesn't work reliably. I don't know if it is something inherent to the compilation process or just a bug, but either way it is up to Apple to fix it or document the limits.
Thanks for sharing your thoughts in such a detailed manner! It's the first time I've come across a reasonable explanation for this.
Quite frustrating that we're having to poke around and infer things from the dark due to a complete lack of documentation from Apple's side.
Could you also clarify on the memory that your devices have?
It's not clear still whether this is a function of the model of chip or the available memory.
I would also be interested to know how well your M2Pro performs in the benchmark results using CoreML
ggerganov/whisper.cpp#89
I'm considering upgrading to an M2Pro/M2Max if the performance is good enough as both llama.cpp and whisper.cpp seems to support coreml now, but I'm looking for solid numbers. The above link only has numbers for M2.