nomic-ai/pygpt4all

On Colab, UnicodeDecodeError

LeMoussel opened this issue · 9 comments

Hello, I am using pygtp4all on Colab, but I often encounter the following error:

---------------------------------------------------------------------------

UnicodeDecodeError                        Traceback (most recent call last)

[<ipython-input-2-1545ea4aa0da>](https://localhost:8080/#) in <cell line: 8>()
      6 
      7 model = GPT4All_J('./ggml-gpt4all-j-v1.2-jazzy.bin')
----> 8 model.generate("Once upon a time, ", n_predict=55, new_text_callback=new_text_callback)

[/usr/local/lib/python3.9/dist-packages/pygptj/model.py](https://localhost:8080/#) in generate(self, prompt, new_text_callback, n_predict, seed, n_threads, top_k, top_p, temp, n_batch)
    122 
    123         # run the prediction
--> 124         pp.gptj_generate(self.gpt_params, self._model, self._vocab, self._call_new_text_callback)
    125         return self.res
    126 

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 0: invalid start byte

Colab soure code:

!pip install -q pygpt4all 
!wget https://gpt4all.io/models/ggml-gpt4all-j-v1.2-jazzy.bin
# https://github.com/nomic-ai/pygpt4all
from pygpt4all.models.gpt4all_j import GPT4All_J

def new_text_callback(text):
    print(text, end="", flush=True)

model = GPT4All_J('./ggml-gpt4all-j-v1.2-jazzy.bin')
model.generate("Once upon a time, ", n_predict=55, new_text_callback=new_text_callback)

I am getting this issue as well on Ubuntu 22.04 VirtualBox VM, I am just trying to run the code example in README file

Also getting this on Ubuntu 20,04 (WSL2) using python 3.11.2. Tried using a fresh venv and installing below but still same problem.
pygpt4all 1.0.1
pygptj 1.0.5

im experiencing this as well on Ubuntu 22.04, python 3.10.9.

i too was just trying to run the code example from the readme. i experienced some additional errors with unknown tokens as well (see the following additional output):

---------------------------------------------------------------------------
model.generate("Once upon a time, ", n_predict=55, new_text_callback=new_text_callback)
gptj_generate: seed = 1682362796
gptj_generate: number of tokens in prompt = 3
gpt_tokenize: unknown token ' '
gpt_tokenize: unknown token 'a'
gpt_tokenize: unknown token 'm'
gpt_tokenize: unknown token 'e'
gpt_tokenize: unknown token ','
gpt_tokenize: unknown token ' '
Once upon tiTraceback (most recent call last):
  File "/opt/python/mambaforge/envs/cbot4me/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3505, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-6-0c9a6b32721a>", line 1, in <module>
    model.generate("Once upon a time, ", n_predict=55, new_text_callback=new_text_callback)
  File "/opt/python/mambaforge/envs/cbot4me/lib/python3.10/site-packages/pygptj/model.py", line 124, in generate
    pp.gptj_generate(self.gpt_params, self._model, self._vocab, self._call_new_text_callback)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 0: invalid start byte

separately, i did try to apply this update mentioned in the comment relating to the similar (but different) issue #61. i actually did get it to successfully generate some text, but it ended up failing again after generation of just a few tokens.

Thanks @LeMoussel for reporting the issue.
I just tried it on Colab and you are right, I don't know why building it from source does not raise this error! Maybe something related to the pre-built wheels process!

I pushed a new version of pygptj v1.0.8 to handle the error as @cbrendanprice mentioned, but it still not working as expected.

from pygptj.model import Model


def new_text_callback(text):
    print(text, end="")


model = Model('/home/su/Downloads/ggml-gpt4all-j.bin', log_level=logging.ERROR)
model.generate("Once upon a time, ", n_predict=55, new_text_callback=new_text_callback)

I tried a lot but couldn't figure out yet why the prebuilt wheels are not working properly!
Please let me know if you have any idea!

Same error for me (Ubuntu 22.04 running on WSL2), tried with both pygptj and pygpt4all.

Same problem on Ubuntu 20.04

model = GPT4All_J(model_path='./models/ggml-gpt4all-j-v1.3-groovy.bin', log_level=logging.ERROR)
model.generate(prompt, n_predict=50, new_text_callback=new_text_callback)

gptj_generate: seed = 1682513212
gpt_tokenize: unknown token ' '
gpt_tokenize: unknown token 'a'
gpt_tokenize: unknown token 'm'
gpt_tokenize: unknown token 'e'
gpt_tokenize: unknown token ','
gpt_tokenize: unknown token ' '
gptj_generate: number of tokens in prompt = 3

Hello guys,
Please update the package and give it a try now ?
I believe the issue should be solved.
(ensure you have pygptj version 1.0.10).

It works for me. Thx!

HI, It's OK with pygptj version 1.0.10