How to inference llama-65b-4bit on mulgpu
Minami-su opened this issue · 6 comments
How to inference llama 65b-4bit on 3090x2
Put about 16GB on one and 23G on the other, set auto devices. Works on textgen for me.
I'm working on a repo to make this possible with speed. It's very early stages, but functioning.
You may want to have look here: https://github.com/Dhaladom/TALIS
Right now with P40 and 3090 together I get 1.6 t/s and can do ~1300 of context. Obviously trition isn't going to cut it with a pre-ampere GPU as the 2nd half.
@Ph0rk0z I don't know so much about the P40 and its potential, but there is a good chance that you may be able to improve your throughput significantly by optimizing memory usage and avoiding unnecessary data shuffling between the cards.
Applying this patch increased my throughput 5-fold, especially on longer sequences. The patch is integrated in my repo I posted above as well.
Other than that, are you running a GUI off the same GPU you are running the models on? I don't know by how much, but it will certainly reduce your much sequence length.
No.. I have a GPU server.. but one of my cards doesn't support triton at all. I have to use cuda for both.
Ok, thanks for pointing that out. It wasn't clear to me that Triton requires Ampere or newer GPU's.