Support the new `mmap`-able ggml format
philpax opened this issue ยท 12 comments
Justine's managed to land her mad-lass mmap-optimised format into llama.cpp. We should support loading this format in - and if we're smart about it, we should also support mmap
ing it in. This should hopefully be easier to do in Rust than it is in C++!
@philpax would you be interested if I added proper support in huggingface/safetensors#197 ?
The format allows for mmap
(or not, but it will currently align buffers for zero-copy loads).
But it is a "pure" rust format (which is also readable from Python)
Hm, don't see why not! It'll depend on model availability, but I imagine we'll start seeing models released in ST format.
FYI: note for mmap
https://justine.lol/mmap/
FYI: note for mmap https://justine.lol/mmap/
Seems like some crazy misinformation to me. I've never even seen a multipart GGML file. The whole dataset is also needed for processing each token, so you can't practically use models larger than memory because it will require repeatedly loading the data from disk.
"100x faster" โ maybe if the entire thing is already in the buffer cache, but that's only possible when you have enough memory available to load the whole model.
mmap is great and some overhead can be avoided (sometimes) when using it, but it's not magic.
Also, does anyone know exactly how the file format changed? Specifically, what is different from the previous version to the current one? Looking at the conversion script isn't that helpful since it just rewrites everything.
I've never even seen a multipart GGML file.
Converting any of the multipart .pths will result in multipart GGMLs.
"100x faster" โ maybe if the entire thing is already in the buffer cache, but that's only possible when you have enough memory available to load the whole model.
Yes, I think the primary benefit is in repeated executions so that the model remains resident across executions.
Also, does anyone know exactly how the file format changed?
My understanding is that it's laying out the data so that it matches the layout in memory. See #114.
Converting any of the multipart .pths will result in multipart GGMLs.
I see. It may make a bigger difference in that case, but they also could have just changed the conversion process to make stuff contiguous without having to mess with the final file format.
Yes, I think the primary benefit is in repeated executions so that the model remains resident across executions.
Well, it's just the OS buffer cache. The OS will cache stuff whether you're using mmap or just reading it normally. mmaping may avoid some copying data, but loading the model was already very fast when the cache was hot.
My understanding is that it's laying out the data so that it matches the layout in memory. See #114.
Nice, someone else is dealing with it! Although, right now the new format part is just a todo!("new format here)
so that doesn't really help with understanding the change at present.
It may make a bigger difference in that case, but they also could have just changed the conversion process to make stuff contiguous without having to mess with the final file format.
I think that's what they did - they just discovered that you can't go all the way without changing the format. The post has some details on this (something about the tensors being interleaved) but I just skimmed over it (I'm out right now)
Well, it's just the OS buffer cache. The OS will cache stuff whether you're using mmap or just reading it normally. mmaping may avoid some copying data, but loading the model was already very fast when the cache was hot.
Yeah. The main benefit is that you aren't pointlessly copying memory, which means the memory traffic is much lower (the cached pages can be used without needing to copy to other pages.)
Speed is also relative: it's slower on Windows than it is on macOS M1.
Although, right now the new format part is just a todo!("new format here) so that doesn't really help with understanding the change at present.
I think that's been addressed now ๐
I think that's what they did - they just discovered that you can't go all the way without changing the format.
That doesn't make sense to me, since there were already 13B and 30B parameter single file GGML models. So the format had to be able to handle that. If multipart models got converted in a way that made dealing with them inefficient, it could have been changed on the converter side.
The main benefit is that you aren't pointlessly copying memory, which means the memory traffic is much lower
It is something that just happens once at startup though. I don't notice a huge difference in load times between llama-rs and llama.cpp and it only took a couple seconds even for a 13B model.
I think that's been addressed now
Haha, my comment is a whole 15 minutes out of date it seems!
That's amazing though, nice work @iacore!
Yeah, I don't know. I haven't been following it that closely - I've just been trying to figure out what we need to get it working here.
hello hello