On Colab, UnicodeDecodeError

Question

On Colab, UnicodeDecodeError

LeMoussel opened this issue a year ago · 9 comments

Hello, I am using pygtp4all on Colab, but I often encounter the following error:

---------------------------------------------------------------------------

UnicodeDecodeError                        Traceback (most recent call last)

[<ipython-input-2-1545ea4aa0da>](https://localhost:8080/#) in <cell line: 8>()
      6 
      7 model = GPT4All_J('./ggml-gpt4all-j-v1.2-jazzy.bin')
----> 8 model.generate("Once upon a time, ", n_predict=55, new_text_callback=new_text_callback)

[/usr/local/lib/python3.9/dist-packages/pygptj/model.py](https://localhost:8080/#) in generate(self, prompt, new_text_callback, n_predict, seed, n_threads, top_k, top_p, temp, n_batch)
    122 
    123         # run the prediction
--> 124         pp.gptj_generate(self.gpt_params, self._model, self._vocab, self._call_new_text_callback)
    125         return self.res
    126 

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 0: invalid start byte

Colab soure code:

!pip install -q pygpt4all 
!wget https://gpt4all.io/models/ggml-gpt4all-j-v1.2-jazzy.bin

# https://github.com/nomic-ai/pygpt4all
from pygpt4all.models.gpt4all_j import GPT4All_J

def new_text_callback(text):
    print(text, end="", flush=True)

model = GPT4All_J('./ggml-gpt4all-j-v1.2-jazzy.bin')
model.generate("Once upon a time, ", n_predict=55, new_text_callback=new_text_callback)

Answer 1 · 2023-04-24T13:07:00.000Z

I am getting this issue as well on Ubuntu 22.04 VirtualBox VM, I am just trying to run the code example in README file

Answer 2 · 2023-04-24T15:16:16.000Z

Also getting this on Ubuntu 20,04 (WSL2) using python 3.11.2. Tried using a fresh venv and installing below but still same problem.
pygpt4all 1.0.1
pygptj 1.0.5

Answer 3 · 2023-04-24T19:06:48.000Z

im experiencing this as well on Ubuntu 22.04, python 3.10.9.

i too was just trying to run the code example from the readme. i experienced some additional errors with unknown tokens as well (see the following additional output):

---------------------------------------------------------------------------
model.generate("Once upon a time, ", n_predict=55, new_text_callback=new_text_callback)
gptj_generate: seed = 1682362796
gptj_generate: number of tokens in prompt = 3
gpt_tokenize: unknown token ' '
gpt_tokenize: unknown token 'a'
gpt_tokenize: unknown token 'm'
gpt_tokenize: unknown token 'e'
gpt_tokenize: unknown token ','
gpt_tokenize: unknown token ' '
Once upon tiTraceback (most recent call last):
  File "/opt/python/mambaforge/envs/cbot4me/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3505, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-6-0c9a6b32721a>", line 1, in <module>
    model.generate("Once upon a time, ", n_predict=55, new_text_callback=new_text_callback)
  File "/opt/python/mambaforge/envs/cbot4me/lib/python3.10/site-packages/pygptj/model.py", line 124, in generate
    pp.gptj_generate(self.gpt_params, self._model, self._vocab, self._call_new_text_callback)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 0: invalid start byte

separately, i did try to apply this update mentioned in the comment relating to the similar (but different) issue #61. i actually did get it to successfully generate some text, but it ended up failing again after generation of just a few tokens.

Answer 4 · 2023-04-25T07:13:32.000Z

Thanks @LeMoussel for reporting the issue.
I just tried it on Colab and you are right, I don't know why building it from source does not raise this error! Maybe something related to the pre-built wheels process!

I pushed a new version of pygptj v1.0.8 to handle the error as @cbrendanprice mentioned, but it still not working as expected.

from pygptj.model import Model


def new_text_callback(text):
    print(text, end="")


model = Model('/home/su/Downloads/ggml-gpt4all-j.bin', log_level=logging.ERROR)
model.generate("Once upon a time, ", n_predict=55, new_text_callback=new_text_callback)

I tried a lot but couldn't figure out yet why the prebuilt wheels are not working properly!
Please let me know if you have any idea!

Answer 5 · 2023-04-25T19:25:25.000Z

Same error for me (Ubuntu 22.04 running on WSL2), tried with both pygptj and pygpt4all.

Answer 6 · 2023-04-26T12:55:10.000Z

Same problem on Ubuntu 20.04

model = GPT4All_J(model_path='./models/ggml-gpt4all-j-v1.3-groovy.bin', log_level=logging.ERROR)
model.generate(prompt, n_predict=50, new_text_callback=new_text_callback)

gptj_generate: seed = 1682513212
gpt_tokenize: unknown token ' '
gpt_tokenize: unknown token 'a'
gpt_tokenize: unknown token 'm'
gpt_tokenize: unknown token 'e'
gpt_tokenize: unknown token ','
gpt_tokenize: unknown token ' '
gptj_generate: number of tokens in prompt = 3

Answer 7 · 2023-04-29T00:43:40.000Z

Hello guys,
Please update the package and give it a try now ?
I believe the issue should be solved.
(ensure you have pygptj version 1.0.10).

Answer 8 · 2023-04-30T19:25:32.000Z

It works for me. Thx!

Answer 9 · 2023-05-02T07:20:53.000Z

HI, It's OK with pygptj version 1.0.10