Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 40 column 3

Question

Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 40 column 3

murthyrudra opened this issue 6 months ago · 5 comments

murthyrudra commented 6 months ago

System Info

transformers version: 4.39.0.dev0
Platform: Linux-4.18.0-513.24.1.el8_9.x86_64-x86_64-with-glibc2.28
Python version: 3.10.13
Huggingface_hub version: 0.20.3
Safetensors version: 0.4.2
Accelerate version: 0.27.2
Accelerate config: not found
PyTorch version (GPU?): 2.1.2+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: yes

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1")

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ruser/py310/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 825, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/ruser/py310/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2048, in from_pretrained
    return cls._from_pretrained(
  File "/home/ruser/py310/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2287, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/ruser/py310/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama_fast.py", line 133, in __init__
    super().__init__(
  File "/home/ruser/py310/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 111, in __init__
    fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 40 column 3

Expected behavior

There should be no error loading tokenizer

Answer 1 · 2024-07-04T13:06:33.000Z

Hello! Is it possible you have an outdated version of tokenizers? Do you mind upgrading to the latest one?

pip install -U tokenizers

Answer 2 · 2024-07-04T13:59:40.000Z

Thanks, Updating the tokenizer library helped :)

Answer 3 · 2024-07-10T14:18:13.000Z

I have this issue and partial update is not working:

!pip install -U 'tokenizers<0.15'
# Successfully installed tokenizers-0.14.1

from transformers import AutoTokenizer
model_id = "mistralai/Mistral-7B-v0.1" 
tokenizer = AutoTokenizer.from_pretrained(model_id, use_auth_token=True)
# Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 40 column 3

!pip install -U 'tokenizers'
# Successfully installed tokenizers-0.19.1
# RESTART NOTEBOOK

from transformers import AutoTokenizer
model_id = "mistralai/Mistral-7B-v0.1" 
tokenizer = AutoTokenizer.from_pretrained(model_id, use_auth_token=True)
# ImportError: tokenizers>=0.14,<0.15 is required for a normal functioning of this module, but found tokenizers==0.19.1.

Full update solves issue to me:

!pip install -U transformers
#Successfully installed huggingface-hub-0.23.4 transformers-4.42.3

And it also works in case of run from xonsh shell.

Answer 4 · 2024-08-02T06:52:12.000Z

I have the same issue, and the package version is latest. transformers==4.43.3 tokenizers==0.19.1

Answer 5 · 2024-08-05T05:56:55.000Z

@littlerookie sorry but can't reproduce:

Could you share a bit more?