Mixtral Instruct tokenizer from Colab notebook doesn't work.
jmuntaner-smd opened this issue · 2 comments
When running the Google Colab notebook, it looks like there is some error when loading the Mixtral Instruct Tokenizer:
[/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_fast.py](https://localhost:8080/#) in __init__(self, *args, **kwargs)
109 elif fast_tokenizer_file is not None and not from_slow:
110 # We have a serialization from tokenizers which let us directly build the backend
--> 111 fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
112 elif slow_tokenizer is not None:
113 # We need to convert a slow tokenizer to build the backend
Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 40 column 3
This appears to be a bug with the transformers and tokenizer versions (see: huggingface/transformers#31789), so the requirements.txt probably need to be updated. But i haven't been able to fix it properly. I changed the tokenizer to the base Mixtral model, but it's not the proper solution.
`> I changed the tokenizer to the base Mixtral model, but it's not the proper solution.
`What is the tokenizer version that you are using? I am also facing a similar issue.
The issue seems to be due to recent commits in
https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1/commits/main
I just changed the google colab line to this: tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-v0.1")