Non-ASCII characters break the output
remixer-dec opened this issue ยท 4 comments
remixer-dec commented
noamgat commented
Hi! This is indeed a current limitation of the library, I hope to address it soon.
noamgat commented
class AnswerFormat(BaseModel):
emoji_strings: List[str]
question = 'What are the 10 most common emojis? You MUST answer using the following json schema: '
now leads to
{
"emoji_strings": [
"๐",
"๐",
"๐คฃ",
"๐",
"๐",
"โค๏ธ",
"๐ค",
"๐",
"๐ญ",
"๐ข",
"๐ "] }
remixer-dec commented
Amazing!
jorge-tromero commented
I'm facing this issue with the transformers integration, and vllm. The generate loop always exits after non-Ascii characters, because can_end() starts returning True.
I found that commenting out line 72 in integrations/transformers.py solves the issue (cleaned = decoded.rstrip('๏ฟฝ')
). That is, not cleaning the decoded sequence, as it removes non-ascii characters such as emojis and makes the parser think the generation is done.