Very slow generation with gpt4all

Question

Very slow generation with gpt4all

imwide opened this issue a year ago · 16 comments

Using gpt4all through the file in the attached image:

works really well and it is very fast, eventhough I am running on a laptop with linux mint. About 0.2 seconds per token. But when running gpt4all through pyllamacpp, it takes up to 10 seconds for one token to generate. Why is that, and how do i speed it up?

Answer 1 · 2023-04-09T19:29:39.000Z

Having the same problem over here. Mac M1, 8GB RAM. Chat works really fast, like in the gif in the README, but pyllamacpp painfully slow. Also, very different output, with lower quality. Might have to do with the new ggml weights (#40)?

Tried with both the directly downloaded gpt4all-lora-quantized-ggml.bin and with converting gpt4all-lora-quantized.bin myself.

Answer 2 · 2023-04-09T22:29:12.000Z

Gpt4all binary is based on an old commit of llama.cpp, so you might get different outcomes when running pyllamacpp.

It might be that you need to build the package yourself, because the build process is taking into account the target CPU, or as @clauslang said, it might be related to the new ggml format, people are reporting similar issues there.

So, What you have to do it to build llama.cpp and compare it to pyllamacpp, if they have the same speed, then it is probably related to the new format.
If llama.cpp is normal, please try to build pyllamacpp as stated in the readme file, and let us know if that solved the issue ?

Answer 3 · 2023-04-10T09:02:16.000Z

Thanks @abdeladim-s. Not 100% sure if that's what you mean by building llama.cpp, but here's what I tried:

Ran the chat version of gpt4all (like in the README) --> works as expected: fast and fairly good output
Built and ran the chat version of alpaca.cpp (like in the README) --> works as expected: fast and fairly good output

Then I tried building pyllamacpp (like in the README):

git clone --recursive https://github.com/nomic-ai/pyllamacpp && cd pyllamacpp
pip install .

and ran the sample script:

from pyllamacpp.model import Model

def new_text_callback(text: str):
    print(text, end="", flush=True)

model = Model(ggml_model='./models/gpt4all-lora-quantized-ggml.bin', n_ctx=512)
model.generate("Once upon a time, ", n_predict=55, new_text_callback=new_text_callback, n_threads=8)

--> very slow and none or poor output
So building from source does not seem to solve the issue for me.

Answer 4 · 2023-04-10T10:45:50.000Z

I have found that decreasing the threads from 8 as a default to 1 doubles the gegneration speed. No idea why but it seems to work. I am trying to get it to go even faster. Ill let you know if I have updates.

Answer 5 · 2023-04-10T11:42:23.000Z

I have found that decreasing the threads from 8 as a default to 1 doubles the gegneration speed. No idea why but it seems to work. I am trying to get it to go even faster. Ill let you know if I have updates.

are you sure it doubles it? threads refers to cpu cores

Answer 6 · 2023-04-10T11:50:26.000Z

Batchsize is the most important thing for speed, don't do too much and don't do too less.

Answer 7 · 2023-04-10T12:56:39.000Z

I have found that decreasing the threads from 8 as a default to 1 doubles the gegneration speed. No idea why but it seems to work. I am trying to get it to go even faster. Ill let you know if I have updates.

are you sure it doubles it? threads refers to cpu cores

I found out what the relation is. Threads cant be more than the number shown in system info, otherwise it becomes REALLY slow. Dont know why since this is easily preventable

Answer 8 · 2023-04-10T13:11:58.000Z

I have found that decreasing the threads from 8 as a default to 1 doubles the gegneration speed. No idea why but it seems to work. I am trying to get it to go even faster. Ill let you know if I have updates.

are you sure it doubles it? threads refers to cpu cores

I found out what the relation is. Threads cant be more than the number shown in system info, otherwise it becomes REALLY slow. Dont know why since this is easily preventable

it's still as lame as a turtle

Answer 9 · 2023-04-10T21:37:58.000Z

@imwide what I meant by building llama.cpp is to follow this .

Yes, increasing the number of threads is causing some issues, not sure why. Use 4 by default.

You are having a similar issue as this, please go over it and let us know if you find any insights.

Answer 10 · 2023-04-14T22:54:53.000Z

Increasing thread count may cause it to include efficiency cores. For me, changing from 8 to 6 for M1 Pro with 6 performance cores fixed it.

Answer 11 · 2023-04-23T21:49:41.000Z

Hi @mattorp,
Is pyllamacpp working on your Mac M1 ? Could you please help solve this issue #57 ?

Answer 12 · 2023-04-24T08:55:46.000Z

Works fine on mine @abdeladim-s, so I'm not of much help for that issue. But hopefully @shivam-singhai's response indicates that the package manager version is the culprit.

Answer 13 · 2023-04-25T03:33:04.000Z

No problem @mattorp.
Yes @shivam-singhai's response seems the solution to that problem.
Thanks :)

Answer 14 · 2023-04-26T15:42:19.000Z

I am having the same problem (gpt4all-lora-quantized-OSX-m1) is very fast (< 1 sec) on my mac. However running with pyllamacpp is very slow. Typical queries with pyllamacpp take > 30 sec. Tried the couple of things suggested above but that didn't change the response time.

Answer 15 · 2023-05-02T21:07:44.000Z

@bsbhaskartp, if it is slow then you just need to build it from source.
@Naugustogi was having the same issue and he succeed to solve it. Please take a look at this issue, it might help

Answer 16 · 2023-05-03T13:24:53.000Z

Thanks. Will try it out

…

On Tue, May 2, 2023 at 2:07 PM Abdeladim Sadiki ***@***.***> wrote: @bsbhaskartp <https://github.com/bsbhaskartp>, if it is slow then you just need to build it from source. @Naugustogi <https://github.com/Naugustogi> was having the same issue and he succeed to solve it. Please take a look at this issue <abdeladim-s/pyllamacpp#3>, it might help — Reply to this email directly, view it on GitHub <#50 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AL6MFS5CKKN4VV5HTZKJ5NDXEFZSXANCNFSM6AAAAAAWYIE6SY> . You are receiving this because you were mentioned.Message ID: ***@***.***>