nomic-ai/pygpt4all

Very slow generation with gpt4all

imwide opened this issue ยท 16 comments

imwide commented

Using gpt4all through the file in the attached image:
image
works really well and it is very fast, eventhough I am running on a laptop with linux mint. About 0.2 seconds per token. But when running gpt4all through pyllamacpp, it takes up to 10 seconds for one token to generate. Why is that, and how do i speed it up?

Having the same problem over here. Mac M1, 8GB RAM. Chat works really fast, like in the gif in the README, but pyllamacpp painfully slow. Also, very different output, with lower quality. Might have to do with the new ggml weights (#40)?

Tried with both the directly downloaded gpt4all-lora-quantized-ggml.bin and with converting gpt4all-lora-quantized.bin myself.

Gpt4all binary is based on an old commit of llama.cpp, so you might get different outcomes when running pyllamacpp.

It might be that you need to build the package yourself, because the build process is taking into account the target CPU, or as @clauslang said, it might be related to the new ggml format, people are reporting similar issues there.

So, What you have to do it to build llama.cpp and compare it to pyllamacpp, if they have the same speed, then it is probably related to the new format.
If llama.cpp is normal, please try to build pyllamacpp as stated in the readme file, and let us know if that solved the issue ?

Thanks @abdeladim-s. Not 100% sure if that's what you mean by building llama.cpp, but here's what I tried:

  • Ran the chat version of gpt4all (like in the README) --> works as expected: fast and fairly good output
  • Built and ran the chat version of alpaca.cpp (like in the README) --> works as expected: fast and fairly good output

Then I tried building pyllamacpp (like in the README):

git clone --recursive https://github.com/nomic-ai/pyllamacpp && cd pyllamacpp
pip install .

and ran the sample script:

from pyllamacpp.model import Model

def new_text_callback(text: str):
    print(text, end="", flush=True)

model = Model(ggml_model='./models/gpt4all-lora-quantized-ggml.bin', n_ctx=512)
model.generate("Once upon a time, ", n_predict=55, new_text_callback=new_text_callback, n_threads=8)

--> very slow and none or poor output
So building from source does not seem to solve the issue for me.

imwide commented

I have found that decreasing the threads from 8 as a default to 1 doubles the gegneration speed. No idea why but it seems to work. I am trying to get it to go even faster. Ill let you know if I have updates.

I have found that decreasing the threads from 8 as a default to 1 doubles the gegneration speed. No idea why but it seems to work. I am trying to get it to go even faster. Ill let you know if I have updates.

are you sure it doubles it? threads refers to cpu cores

Batchsize is the most important thing for speed, don't do too much and don't do too less.

imwide commented

I have found that decreasing the threads from 8 as a default to 1 doubles the gegneration speed. No idea why but it seems to work. I am trying to get it to go even faster. Ill let you know if I have updates.

are you sure it doubles it? threads refers to cpu cores

I found out what the relation is. Threads cant be more than the number shown in system info, otherwise it becomes REALLY slow. Dont know why since this is easily preventable

I have found that decreasing the threads from 8 as a default to 1 doubles the gegneration speed. No idea why but it seems to work. I am trying to get it to go even faster. Ill let you know if I have updates.

are you sure it doubles it? threads refers to cpu cores

I found out what the relation is. Threads cant be more than the number shown in system info, otherwise it becomes REALLY slow. Dont know why since this is easily preventable

it's still as lame as a turtle

@imwide what I meant by building llama.cpp is to follow this .

Yes, increasing the number of threads is causing some issues, not sure why. Use 4 by default.

You are having a similar issue as this, please go over it and let us know if you find any insights.

Increasing thread count may cause it to include efficiency cores. For me, changing from 8 to 6 for M1 Pro with 6 performance cores fixed it.

Hi @mattorp,
Is pyllamacpp working on your Mac M1 ? Could you please help solve this issue #57 ?

Works fine on mine @abdeladim-s, so I'm not of much help for that issue. But hopefully @shivam-singhai's response indicates that the package manager version is the culprit.

No problem @mattorp.
Yes @shivam-singhai's response seems the solution to that problem.
Thanks :)

I am having the same problem (gpt4all-lora-quantized-OSX-m1) is very fast (< 1 sec) on my mac. However running with pyllamacpp is very slow. Typical queries with pyllamacpp take > 30 sec. Tried the couple of things suggested above but that didn't change the response time.

@bsbhaskartp, if it is slow then you just need to build it from source.
@Naugustogi was having the same issue and he succeed to solve it. Please take a look at this issue, it might help