IntrinsicLabsAI/gbnfgen

Exclude unicode characters from string?

Closed this issue · 2 comments

Hi, I can't find any documentation but would like to exclude unicode characters from strings.

I.e. 0x00-0x08 and 0x0B-0x1F so that I don't get crap. Is there a way in GBNF to eliminate these? It would be useful to have this in the string set.

a10y commented

Hey @JohnGalt1717,

I'm not sure we have strong opinions about this on our end, if you'd like to put together a sketch for a PR of what you're thinking happy to review it

a10y commented

Hey @JohnGalt1717 , after some poking, it looks like this is something GBNF supports: https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md#characters-and-character-ranges

However, to my knowledge there isn't anything in TS type system that allows you to declare a string as only containing Latin1 characters. Given that the input to this library is a TypeScript type definition, I don't think this would be something we could support on our end.

One alternative could be to use outlines. You can write a pydantic model that uses constr to do this, e.g:

from pydantic import BaseModel, constr

class ASCIIModel(BaseModel):
    ascii_string: constr(regex=r'^[\x00-\x7F]+$')