huggingface/tokenizers

Bug with `CodeQwen1.5`: `data did not match any variant of untagged enum PyPreTokenizerTypeWrapper`

QwertyJack opened this issue · 1 comments

When I load CodeQwen1.5-7B-Chat it complains:

  File "/home/ubuntu/conda/envs/vllm-0.4.2/lib/python3.11/site-packages/vllm/transformers_utils/tokenizer_group/tokenizer_group.py", line 23, in __init__
    self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/conda/envs/vllm-0.4.2/lib/python3.11/site-packages/vllm/transformers_utils/tokenizer.py", line 92, in get_tokenizer
    tokenizer = AutoTokenizer.from_pretrained(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/conda/envs/vllm-0.4.2/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 862, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/conda/envs/vllm-0.4.2/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2089, in from_pretrained
    return cls._from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/conda/envs/vllm-0.4.2/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2311, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/conda/envs/vllm-0.4.2/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 111, in __init__
    fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 12564 column 3

However, it works fine if downgraded to a previous version, e.g. tokenizers==0.15.2.

Yep I think there was a problem in the conversion, probably used the main at some point