huggingface/transformers

Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 40 column 3

murthyrudra opened this issue · 5 comments

System Info

  • transformers version: 4.39.0.dev0
  • Platform: Linux-4.18.0-513.24.1.el8_9.x86_64-x86_64-with-glibc2.28
  • Python version: 3.10.13
  • Huggingface_hub version: 0.20.3
  • Safetensors version: 0.4.2
  • Accelerate version: 0.27.2
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.2+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: yes

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ruser/py310/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 825, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/ruser/py310/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2048, in from_pretrained
    return cls._from_pretrained(
  File "/home/ruser/py310/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2287, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/ruser/py310/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama_fast.py", line 133, in __init__
    super().__init__(
  File "/home/ruser/py310/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 111, in __init__
    fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 40 column 3


Expected behavior

There should be no error loading tokenizer

Hello! Is it possible you have an outdated version of tokenizers? Do you mind upgrading to the latest one?

pip install -U tokenizers

Thanks, Updating the tokenizer library helped :)

I have this issue and partial update is not working:

!pip install -U 'tokenizers<0.15'
# Successfully installed tokenizers-0.14.1

from transformers import AutoTokenizer
model_id = "mistralai/Mistral-7B-v0.1" 
tokenizer = AutoTokenizer.from_pretrained(model_id, use_auth_token=True)
# Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 40 column 3

!pip install -U 'tokenizers'
# Successfully installed tokenizers-0.19.1
# RESTART NOTEBOOK

from transformers import AutoTokenizer
model_id = "mistralai/Mistral-7B-v0.1" 
tokenizer = AutoTokenizer.from_pretrained(model_id, use_auth_token=True)
# ImportError: tokenizers>=0.14,<0.15 is required for a normal functioning of this module, but found tokenizers==0.19.1.

Full update solves issue to me:

!pip install -U transformers
#Successfully installed huggingface-hub-0.23.4 transformers-4.42.3

And it also works in case of run from xonsh shell.

I have the same issue, and the package version is latest. transformers==4.43.3 tokenizers==0.19.1

@littlerookie sorry but can't reproduce:
image
Could you share a bit more?