How to inference llama-65b-4bit on mulgpu

Question

How to inference llama-65b-4bit on mulgpu

Minami-su opened this issue 2 years ago · 6 comments

Minami-su commented 2 years ago

How to inference llama 65b-4bit on 3090x2

Answer 1 · 2023-05-13T12:07:48.000Z

Put about 16GB on one and 23G on the other, set auto devices. Works on textgen for me.

Answer 2 · 2023-05-13T15:05:35.000Z

I'm working on a repo to make this possible with speed. It's very early stages, but functioning.

You may want to have look here: https://github.com/Dhaladom/TALIS

Answer 3 · 2023-05-13T15:12:36.000Z

Right now with P40 and 3090 together I get 1.6 t/s and can do ~1300 of context. Obviously trition isn't going to cut it with a pre-ampere GPU as the 2nd half.

Answer 4 · 2023-05-13T18:13:38.000Z

@Ph0rk0z I don't know so much about the P40 and its potential, but there is a good chance that you may be able to improve your throughput significantly by optimizing memory usage and avoiding unnecessary data shuffling between the cards.

Applying this patch increased my throughput 5-fold, especially on longer sequences. The patch is integrated in my repo I posted above as well.

Other than that, are you running a GUI off the same GPU you are running the models on? I don't know by how much, but it will certainly reduce your much sequence length.

Answer 5 · 2023-05-13T18:17:53.000Z

No.. I have a GPU server.. but one of my cards doesn't support triton at all. I have to use cuda for both.

Answer 6 · 2023-05-14T06:19:14.000Z

Ok, thanks for pointing that out. It wasn't clear to me that Triton requires Ampere or newer GPU's.