huggingface/tokenizers

BpeTrainer seems to ignore max_token_length=1

geajack opened this issue · 2 comments

In the following script, the resulting vocabulary contains tokens of length > 1.

from tokenizers.trainers import BpeTrainer
from tokenizers import Tokenizer
from tokenizers.models import BPE

trainer = BpeTrainer(max_token_length=1)

tokenizer_spec = Tokenizer(BPE())
tokenizer_spec.train_from_iterator(["hello world"], trainer=trainer)
vocab = tokenizer_spec.get_vocab()
print(vocab)

What I'd expect, instead, would just be to get a vocabulary consisting of all of the characters in the corpus.

I can indeed reproduce. It actually works for any other value:

In [2]: from tokenizers.trainers import BpeTrainer
   ...: from tokenizers import Tokenizer
   ...: from tokenizers.models import BPE
   ...: 
   ...: trainer = BpeTrainer(max_token_length=64)
   ...: 
   ...: tokenizer_spec = Tokenizer(BPE())
   ...: tokenizer_spec.train_from_iterator(["hello world, orl lorld, corld forld"], trainer=trainer)
   ...: vocab = tokenizer_spec.get_vocab()
   ...: print(vocab)

but I don't think you can go lower than 2.
Most probably an issue with

pub(super) fn merge(