AlexBuz/llama-zip

Using different models like Phi-3

CyberTimon opened this issue · 7 comments

Hello!

When using smaller models (but with bigger context windows) like Phi-3 result in a worse compression ratio?
Have you already tested this?

Thanks

Short update, I've tried Phi-3-Mini, it memory leaked and completely crashed my M1 Max 64GB MacBook Pro. Investigating it further.

Just tried Phi-3-Mini-128k and the same thing happened to me (same machine). I'm looking into it now as well.

Also wondering in parallel if a big model would give significantly stronger compression ratios.

As an update on this, it appears the crashing on the M1 Max 64GB results from a combination of these factors:

  • Offloading too many of the model's layers to the GPU
  • Running the model with too large of a context
  • Having mlock enabled

I've now pushed an update that disables mlock by default, but even with it disabled, when running Phi-3-Mini with 128k context and all layers offloaded to the GPU, it still runs very slowly and in fact nondeterministically on my machine, which renders it unsuitable for compression purposes. Disabling GPU offloading does solve this, but I've also now added an option to specify a different (smaller) context length as an alternative solution.

As for the compression ratio of a smaller model with a longer context, stay tuned. I plan to add Phi-3 to the table soon. I'm not sure if I will be able to test any larger models though, as it would take an unreasonable amount of time to run them through my benchmark on my machine.

Phi-3 is now added! Looks like it significantly outperforms Llama 3 on compressing code, and does very well on non-code as well.

Great to hear. Will test it out. Thank you!