Special token gets tokenized while training tokenizer from scratch

Question

Special token gets tokenized while training tokenizer from scratch

LalchandPandia opened this issue 4 months ago · 1 comments

@ArthurZucker I am trying to train a bytepiece tokenizer on my dataset. I have a list of words which I want to be treated as a single token. But when I train it and tokenize, I observe that the token gets split in tow parts. My end goal is to train a Roberta LM on my dataset.
`from tokenizers import BertWordPieceTokenizer, ByteLevelBPETokenizer

files = 'file.txt'

tokenizer = ByteLevelBPETokenizer( lowercase=True, )
tokenizer.train( files, vocab_size=100000, min_frequency=5, show_progress=True, special_tokens=["~~", "~~", "", "", "", "auto_part", "bokchoy"], )

tokenizer.save_model('bpe_piece') `

Test the tokenizer:
from transformers import RobertaTokenizer tokenizer = RobertaTokenizer.from_pretrained('bpe_piece') print(tokenizer.tokenize('an bokchoy auto_part))
Output should be ['an', 'bokchoy', 'auto_part']
But instead the output is ['an', 'Ġbok', 'choy', 'Ġauto', '_', 'part']

Answer 1 · 2024-09-02T14:55:28.000Z

@ArthurZucker