Bug: is_pretokenized is not used when calling tokenizer.encode(...)
Opened this issue · 0 comments
jannessm commented
is_pretokenized doesnt seem to be respected in some cases. The same code given below works in 0.20.0
Code
from tokenizers import Tokenizer, pre_tokenizer
from tokenizers.models import WordPiece
m = WordPiece({'F': 0, '<eos>': 1})
t = Tokenizer(m)
t.pre_tokenizer = pre_tokenizers.Split('', 'isolated')
t.encode(['<eos>'], is_pretokenized=True).ids
Expected to run without any issue but raises the exception:
Exception: WordPiece error: Missing [UNK] token from the vocabulary
It seems to ignore the is_pretokenized
flag and wants to apply the pre_tokenizer to the <eos>
token.