r2d4/parserllm

Error with WordLevel tokenizer

Opened this issue · 0 comments

Tried examples/example.py with a tokenizer derived from a dict[str, int]:

from tokenizers import Tokenizer
from tokenizers.models import WordLevel
tokenizer = WordLevel(Tokenizer(str_to_int_dict))
tokenizer.eos_token_id = '\n'
<remaining example.py code>

Stack trace:

Traceback (most recent call last):
  File "<redacted>", line 177, in <module>
    '',
  File "/usr/local/lib/python3.10/dist-packages/parserllm/parserllm.py", line 43, in complete_cf
    terminal_regexes = extract_terminal_regex(parser, tokenizer.decode(tokenizer.eos_token_id))
  File "/usr/local/lib/python3.10/dist-packages/parserllm/parserllm.py", line 14, in extract_terminal_regex
    regex_map['$END'] = regex.compile(stop_token)
  File "<redacted>/.local/lib/python3.10/site-packages/regex/regex.py", line 353, in compile
    return _compile(pattern, flags, ignore_unused, kwargs, cache_pattern)
  File "/<redacted>/.local/lib/python3.10/site-packages/regex/regex.py", line 519, in _compile
    raise TypeError("first argument must be a string or compiled pattern")
TypeError: first argument must be a string or compiled pattern