tloen/llama-int8

Tracking issue for Mac support

pannous opened this issue · 3 comments

M1 / M2 32GB … 128GB any hopes?

No luck with this repo, "bitsandbytes" dependency is heavily relying on CUDA.
But there is a repo for cpu inference, just change the prompts to prompts[0], so it doesn't crash with max_batch_size=1.
It takes more than 10 minutes to produce output with max_gen_len=20, even GPT-J 5B took me around a minute on CPU.
I also tried to make an MPS port with gpu acceleration, it works faster, but the output is not good enough imo, not sure if it is always good on cpu or if I just got lucky on my first generation. UPDATE: the model gives good outputs with python3.10 + pytorch-nightly

thanks!

Actually, I was wrong. After I tried my port with a higher version of python+pytorch, the outputs were as good as the cpu ones, I am happy that it worked after all!