BpeTrainer seems to ignore max_token_length=1
geajack opened this issue · 2 comments
geajack commented
In the following script, the resulting vocabulary contains tokens of length > 1.
from tokenizers.trainers import BpeTrainer
from tokenizers import Tokenizer
from tokenizers.models import BPE
trainer = BpeTrainer(max_token_length=1)
tokenizer_spec = Tokenizer(BPE())
tokenizer_spec.train_from_iterator(["hello world"], trainer=trainer)
vocab = tokenizer_spec.get_vocab()
print(vocab)
What I'd expect, instead, would just be to get a vocabulary consisting of all of the characters in the corpus.
ArthurZucker commented
I can indeed reproduce. It actually works for any other value:
In [2]: from tokenizers.trainers import BpeTrainer
...: from tokenizers import Tokenizer
...: from tokenizers.models import BPE
...:
...: trainer = BpeTrainer(max_token_length=64)
...:
...: tokenizer_spec = Tokenizer(BPE())
...: tokenizer_spec.train_from_iterator(["hello world, orl lorld, corld forld"], trainer=trainer)
...: vocab = tokenizer_spec.get_vocab()
...: print(vocab)
but I don't think you can go lower than 2
.
Most probably an issue with
tokenizers/tokenizers/src/models/bpe/word.rs
Line 106 in d3e8008