Non-ASCII characters break the output

Question

Non-ASCII characters break the output

remixer-dec opened this issue a year ago · 4 comments

Hi, I tried the llama.cpp example and noticed that this library cannot handle non-ascii characters correctly by default and it stops right after running into such character.

Is it possible to allow these characters?

remixer-dec commented a year ago

Amazing!

Answer 1 · 2023-11-03T06:21:33.000Z

Hi! This is indeed a current limitation of the library, I hope to address it soon.

Answer 2 · 2023-11-06T18:08:51.000Z

class AnswerFormat(BaseModel):
    emoji_strings: List[str]

question = 'What are the 10 most common emojis? You MUST answer using the following json schema: '

now leads to

  {
"emoji_strings": [
"😊",
"👍",
"🤣",
"🚀",
"🎉",
"❤️",
"🤔",
"🙏",
"💭",
"😢",
"😠"] }

Answer 3 · 2024-09-03T13:25:53.000Z

I'm facing this issue with the transformers integration, and vllm. The generate loop always exits after non-Ascii characters, because can_end() starts returning True.

I found that commenting out line 72 in integrations/transformers.py solves the issue (cleaned = decoded.rstrip('�')). That is, not cleaning the decoded sequence, as it removes non-ascii characters such as emojis and makes the parser think the generation is done.