Running models above Q3 for higher quality output
sussyboiiii opened this issue ยท 14 comments
Hello,
This application is exactly what ive been searching for a while now, simple yet useful with features like chat saving etc. implemented + not running as a web ui. But the problem I've encountered is that I cant run a model higer than Q3 which is a bummer because running Q5 or higher for better output on higher end systems would be super cool.
I hope you can change this to make this project even cooler!
Thanks!
Thanks for trying FreeChat, glad it's useful for you. I typically run a Q8 and have run Q4, Q5, and Q2. Can you link me to the exact model you're trying to run? Also some specs for your machine would be useful to help debug (processor and RAM).
Hello, I forgot, I'm on Mac OS 14.2 on a MacBook Pro M1 Pro 16GB. I've tried to run wizardlm v1.2 gguf from the bloke and wizard vicuna from the bloke as well. I've tried Q5_K_M because it's the best quality with performance for my machine but the application crashes unless I try Q3 like the recommended model. To note the models both work unsung llama.cpp
I have tried a few different models, Q8_0 aswell but they all led to a crash of the application. If you want to I can upload the apple crash report.
Thanks for the info, and uploading the crash report would be great. Are you on the latest TestFlight or the app store build? If you haven't, try the latest TestFlight as it fixes some crash behavior I found.
However, it looks like you are right at the edge of the model size your computer would support. According to stuff I've read like this, macOS limits ~75% of RAM to be allocated as VRAM.
I have downloaded the beta and it still crashes. Running the model in llama.cpp in the terminal works though. If this limit is truly the problem then apple has indruced something to make you system worse because in macos 12 or 13 i dont remember i ran a LLM which utilized 45GB on my Mac.
Anyways heres the log:
applelogbeta.txt
if this limit is the problem then truly the problem then apple has indruced something to make you system worse
I have discovered this in llama.cpp when trying to use Q8_0 on another model and gpu on:(13758,27 / 10922,67)ggml_metal_add_buffer: warning: current allocated size is greater than the recommended max working set size
If I turn gpu off then it will run without crashing but proper slow: (10875,73 ms per token, 0,09 tokens per second)
So this may be apple taking the piss again here but I still dont get why Q5_K_M wont run as it is below 10GB.
Ah yes, that must be it because FreeChat uses llama with GPU on. Perhaps I could do this conditionally depending on RAM and model size but it sounds like the GPU off mode is not very useful anyway at those speeds.
Does the Q5_K_M work in plain llama.cpp with GPU on? What other flags are you running with llama? The latest FreeChat defaults are here which maybe use more resources than how you're running it. https://github.com/psugihara/FreeChat/blob/main/mac/FreeChat/Models/NPC/LlamaServer.swift#L320
FreeChat basically just boots llama.cpp's server
binary and passes those arguments on each request. We could clearly use some better diagnostics when errors happen and I'd like to try capturing in an upcoming release. Thanks for all the info you're providing here though.
Thank you for your supprt!
Here is the command I use to test, it definitley isnt the most optimized for anything but here it is, this command may also be outdated don't know I made it over 1 year ago:
./main -ngl -1 -m ./models/german/em_german_13b_v01.Q5_K_M.gguf -n 2048 -t 6 --repeat_penalty 1.1 --color -i -r "USER:" -f ./prompts/german.txt
I forgot to say that GPT4All runs the 13B models crashing FreeChat but also crashes once the vram usage surpasses 10GB on my system.
If possible implementing something like try catch to not crash the whole application and spit out some error message may also be something nice to implement.
something like try catch to not crash the whole application and spit out some error message may also be something nice to implement
yep that's what I meant I'd like to try in an upcoming release. Hopefully will be more clear what's causing the crash if we can get that going.
Alright, thank you!
If I can test something for you please just message me here!
I discovered the solution to the problem, thanks to the slider to adjust the context lenth in the beta the model now runs. A context lenth of above 1024 for a 13B Q5_K_M model will surpass the 10GB limit and crash.
ctx of 1024:
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 800,03 MiB, ( 9732,20 / 10922,67) ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 112,02 MiB, ( 9844,22 / 10922,67)
ctx of 2048 (already crashes):
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 1600,03 MiB, (10532,20 / 10922,67) ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 194,02 MiB, (10726,22 / 10922,67)
ctx of 4096:
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 3200,03 MiB, (12132,20 / 10922,67)ggml_metal_add_buffer: warning: current allocated size is greater than the recommended max working set size ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 358,02 MiB, (12490,22 / 10922,67)ggml_metal_add_buffer: warning: current allocated size is greater than the recommended max working set size
These are just some things that may be nice to add in the future and you may have planned that already.
Thank you for creating this gui!
- Add a button to regenerate responses or stop the response
- Make the advanced tab more advanced (including all available options for a specific model)
- Allow saving of custom system prompts
- Make settings and prompts model-specific, so they automatically adjust when switching between models
- Add a button to switch between GPU and CPU (using only CPU allows you to use more than the RAM your machine has if slower, using -t -1 and -c 4096 now works if slower than gpu but its respectable + even a context size of 32.768 worked using 25GB of ram if slow but its just cool to play around with)
thanks! I like all of these suggestions and will consider how to add them. Glad you got a workaround with context length!