noamgat/lm-format-enforcer

Non-ASCII characters break the output

remixer-dec opened this issue ยท 4 comments

Hi, I tried the llama.cpp example and noticed that this library cannot handle non-ascii characters correctly by default and it stops right after running into such character.
image
image
Is it possible to allow these characters?

Hi! This is indeed a current limitation of the library, I hope to address it soon.

class AnswerFormat(BaseModel):
    emoji_strings: List[str]

question = 'What are the 10 most common emojis? You MUST answer using the following json schema: '

now leads to

  {
"emoji_strings": [
"๐Ÿ˜Š",
"๐Ÿ‘",
"๐Ÿคฃ",
"๐Ÿš€",
"๐ŸŽ‰",
"โค๏ธ",
"๐Ÿค”",
"๐Ÿ™",
"๐Ÿ’ญ",
"๐Ÿ˜ข",
"๐Ÿ˜ "] }

Amazing!

I'm facing this issue with the transformers integration, and vllm. The generate loop always exits after non-Ascii characters, because can_end() starts returning True.

I found that commenting out line 72 in integrations/transformers.py solves the issue (cleaned = decoded.rstrip('๏ฟฝ')). That is, not cleaning the decoded sequence, as it removes non-ascii characters such as emojis and makes the parser think the generation is done.