Tokens Removed from Trained Custom BPE Tokenizer

Question

Tokens Removed from Trained Custom BPE Tokenizer

rteehas opened this issue 2 months ago · 0 comments

Hi,

I've trained a custom BPE tokenizer on unicode strings with an initial starting vocab. I noticed, however, that some tokens from the starting vocab don't appear in the vocab of the trained BPE tokenizer. Is this expected? For example, if those tokens did not appear when training the tokenizer, will they be removed? Is there a way to force the tokenizer to preserve the starting vocab?