An error regarding Unicode DecodeError（关于Unicode DecodeError的错误）

Question

An error regarding Unicode DecodeError（关于Unicode DecodeError的错误）

funnygeeker opened this issue a year ago · 7 comments

Hello, I am using an Alpaca model that supports Chinese, but I often encounter the following error when using PyLLaMACpp:
（您好，我使用了一个支持中文的alpaca模型，但是在使用PyLLaMACpp时经常出现以下错误：）

Traceback (most recent call last):
  File "C:\Users\xxxxxxx\PycharmProjects\pyllamacpp\main.py", line 10, in <module>
    model.generate("你好", n_predict=64, new_text_callback=new_text_callback, n_threads=8, verbose=True)
  File "C:\Users\xxxxxxx\PycharmProjects\pyllamacpp\venv\lib\site-packages\pyllamacpp\model.py", line 112, in generate
    pp.llama_generate(self._ctx, self.gpt_params, self._call_new_text_callback, verbose)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 0: unexpected end of data

This model can generally run normally in other similar programs, but it always encounters errors when generating replies using PyLLaMACpp.（这个模型在其他类似程序中基本能正常运行，但是在使用PyLLaMACpp时总是在回复生成到一半时出错。）

The source of the model（模型来源）：https://huggingface.co/P01son/ChatLLaMA-zh-7B-int4

Answer 1 · 2023-04-17T12:16:08.000Z

Additional console output:

llama_model_load: loading model from 'C:/Users/xxxxxxx/Desktop/ai/llama/chatllama-ggml-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)
llama_model_load: loading tensors from 'C:/Users/xxxxxxx/Desktop/ai/chatllama-ggml-q4_0.bin'
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  512.00 MB
 从前，llama_generate: seed = 1681733560

system_info: n_threads = 8 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 64, n_keep = 0


有一个Traceback (most recent call last):
  File "C:\Users\yangc\PycharmProjects\pyllamacpp\main.py", line 10, in <module>
    model.generate("从前，", n_predict=64, new_text_callback=new_text_callback, n_threads=8, verbose=True)
  File "C:\Users\yangc\PycharmProjects\pyllamacpp\venv\lib\site-packages\pyllamacpp\model.py", line 112, in generate
    pp.llama_generate(self._ctx, self.gpt_params, self._call_new_text_callback, verbose)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 0: unexpected end of data

The process has ended with exit code 1.

Answer 2 · 2023-04-17T18:54:27.000Z

I'm confirm the problem on windows and linux with different models. To reproduce the problem, you can also try to translate the word into several unicode-encoded languages.
model.generate('translation "market" to korean, chinese, arabic, spanish languages is:\n', n_predict=1024, new_text_callback=new_text_callback, n_threads=8)

Bug fixing the same issue in other projects looks like:

def read_tokens(fin, vocab_size):
    tokens = []
    for _ in range(vocab_size):
        text_len = struct.unpack("i", fin.read(4))[0]
        text_bytes = fin.read(text_len)
        try:
            text = text_bytes.decode("utf-8")
        except UnicodeDecodeError:
            text = text_bytes.decode("utf-8", "replace")
        score = struct.unpack("f", fin.read(4))[0]
        tokens.append((text, score))
    return tokens

But how to apply this to a function pp.llama_generate?

Answer 3 · 2023-04-17T20:50:02.000Z

I managed to use the function @Fikavec suggested, seems to work with the original input that caused the UnicodeDecodeError.
Not sure if this is the best way to do it, but i had to
https://pybind11.readthedocs.io/en/latest/advanced/cast/strings.html#returning-c-strings-to-python

Edited src/main.cpp so that it feeds a py::bytes type, instead of converting llama_token_to_str return value to py::str implicitly (this is what probably triggers the UnicodeDecodeError) when calling new_text_callback();

--- a/src/main.cpp
+++ b/src/main.cpp
@@ -447,7 +447,9 @@ int llama_generate(struct llama_context_wrapper * ctx_w, gpt_params params, py::
         if (!input_noecho) {
             for (auto id : embd) {
 //                printf("%s", llama_token_to_str(ctx, id));
-                new_text_callback(llama_token_to_str(ctx, id));
+                   std::string res = llama_token_to_str(ctx, id);
+               py::bytes py_res = py::bytes(res);
+                new_text_callback(py_res);
             }
             fflush(stdout);
         }

Modify the signature of new_text_callback, to be Callable([[bytes], None])

--- a/pyllamacpp/model.py
+++ b/pyllamacpp/model.py
@@ -20,6 +20,8 @@ import logging
 import sys
 import _pyllamacpp as pp
 
+import pdb
+
 
 class Model:
     """
@@ -62,7 +64,7 @@ class Model:
         # gpt params
         self.gpt_params = pp.gpt_params()
 
-        self.res = ""
+        self.res = b""
 
     @staticmethod
     def _set_params(params, kwargs: dict) -> None:
@@ -86,7 +88,7 @@ class Model:
 
     def generate(self, prompt: str,
                  n_predict: int = 128,
-                 new_text_callback: Callable[[str], None] = None,
+                 new_text_callback: Callable[[bytes], None] = None,
                  verbose: bool = False,
                  **gpt_params) -> str:
         """
@@ -105,7 +107,7 @@ class Model:
         self._set_params(self.gpt_params, gpt_params)
 
         # assign new_text_callback
-        self.res = ""
+        self.res = b""
         Model._new_text_callback = new_text_callback

Now we can manually handle ourselves inside the callback the behavior in case of UnicodeDecodeError

from pyllamacpp.model import Model
import pdb
import sys

def new_text_callback(text: bytes):
    #pdb.set_trace()
    new_text = b""
    try:
        new_text = text.decode('utf-8')
    except UnicodeDecodeError:
        new_text = text.decode("utf-8", "replace")
    print(new_text, end="", flush=True)

model = Model(ggml_model='/home/ubuntu/models/gpt4all-lora-quantized-ggjt.bin', n_ctx=512)
model.generate("从前，", n_predict=64, new_text_callback=new_text_callback, n_threads=4, verbose=True)

venv) ubuntu@llama:~/pyllamacpp$ python test.py 
llama_model_load: loading model from '/home/ubuntu/models/gpt4all-lora-quantized-ggjt.bin' - please wait ...
llama_model_load: n_vocab = 32001
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)
llama_model_load: loading tensors from '/home/ubuntu/models/gpt4all-lora-quantized-ggjt.bin'
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  512.00 MB
llama_generate: seed = 1681764454

system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 64, n_keep = 0


 从前，我们认为������的是真正美好，因为���能力了解其���人的需求。然而，现在我们已经发现，这个������还有���密，那就
llama_print_timings:        load time =  3022.36 ms
llama_print_timings:      sample time =    53.13 ms /    64 runs   (    0.83 ms per run)
llama_print_timings: prompt eval time =  1451.87 ms /     5 tokens (  290.37 ms per token)
llama_print_timings:        eval time = 21015.89 ms /    63 runs   (  333.59 ms per run)
llama_print_timings:       total time = 24094.05 ms
(venv) ubuntu@llama:~/pyllamacpp$

Answer 4 · 2023-04-24T14:09:06.000Z

Thanks @r0psteev It worked like a charm !

Answer 5 · 2023-05-02T22:13:57.000Z

Great idea @r0psteev.
can you submit a PR here so I can merge your changes.
Thank you!

Answer 6 · 2023-05-03T21:14:20.000Z

yes, sure @abdeladim-s

Answer 7 · 2023-05-04T04:06:49.000Z

This should be working now thanks to the amazing contribution of @r0psteev, please free to reopen it if you still have any issues.