qwopqwop200/GPTQ-for-LLaMa

How to inference llama-65b-4bit on mulgpu

Minami-su opened this issue · 6 comments

How to inference llama 65b-4bit on 3090x2

Put about 16GB on one and 23G on the other, set auto devices. Works on textgen for me.

I'm working on a repo to make this possible with speed. It's very early stages, but functioning.

You may want to have look here: https://github.com/Dhaladom/TALIS

Right now with P40 and 3090 together I get 1.6 t/s and can do ~1300 of context. Obviously trition isn't going to cut it with a pre-ampere GPU as the 2nd half.

@Ph0rk0z I don't know so much about the P40 and its potential, but there is a good chance that you may be able to improve your throughput significantly by optimizing memory usage and avoiding unnecessary data shuffling between the cards.

Applying this patch increased my throughput 5-fold, especially on longer sequences. The patch is integrated in my repo I posted above as well.

Other than that, are you running a GUI off the same GPU you are running the models on? I don't know by how much, but it will certainly reduce your much sequence length.

No.. I have a GPU server.. but one of my cards doesn't support triton at all. I have to use cuda for both.

Ok, thanks for pointing that out. It wasn't clear to me that Triton requires Ampere or newer GPU's.