Mistral's tokenizer is not optimized
Yarflam opened this issue · 0 comments
Yarflam commented
Hello!
How to reproduce:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('mistralai/Mistral-7B-Instruct-v0.2')
tokenizer.add_bos_token = False
tokenizer.add_eos_token = False
ids = [ 12866, 601 ] # "▁domestic" + "ated"
decode = tokenizer.decode(ids)
encode = tokenizer.encode(decode)
print(encode)
# output -> [2853, 374, 6899]
# "▁dom" + "est" + "icated"
I don't know what's the best thing to do and if this case has an impact on the calculation.
It's just a feedback - but I'm sure it's possible to find another cases.