Mixtral Instruct tokenizer from Colab notebook doesn't work.

Question

Mixtral Instruct tokenizer from Colab notebook doesn't work.

jmuntaner-smd opened this issue 6 months ago · 2 comments

When running the Google Colab notebook, it looks like there is some error when loading the Mixtral Instruct Tokenizer:

[/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_fast.py](https://localhost:8080/#) in __init__(self, *args, **kwargs)
    109         elif fast_tokenizer_file is not None and not from_slow:
    110             # We have a serialization from tokenizers which let us directly build the backend
--> 111             fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
    112         elif slow_tokenizer is not None:
    113             # We need to convert a slow tokenizer to build the backend

Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 40 column 3

This appears to be a bug with the transformers and tokenizer versions (see: huggingface/transformers#31789), so the requirements.txt probably need to be updated. But i haven't been able to fix it properly. I changed the tokenizer to the base Mixtral model, but it's not the proper solution.

Answer 1 · 2024-07-11T06:35:02.000Z

`> I changed the tokenizer to the base Mixtral model, but it's not the proper solution.

`What is the tokenizer version that you are using? I am also facing a similar issue.

The issue seems to be due to recent commits in
https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1/commits/main

Answer 2 · 2024-07-11T14:54:11.000Z

I just changed the google colab line to this: tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-v0.1")