Byte Level BPE Tokenizer (GPT2/RoBERTa )
sinamoeini opened this issue · 5 comments
Hi TensorFlow team,
Is there going to be a byte level bpe tokenizer in tensorflow text?
BPE is already supported by SentencePiece. If you have a sentence piece model, you can use it with the sentencepiece op.
Thank you @thuang513. So sentencepiece and byte-level BPE only differ in training phase, right? So if I have a trained byte-level BPE I should be able to use sentencepiece and just give it the vocab and merges.
I believe so. If you trained the model using SentencePiece you should be able to use it as is.
I dug a bit more and it seems that there is a small difference in how they treat spaces. I will try to replicate hugging face RoBERTa tokenizer using tensorflow sentence piece tokenizer and update this thread
@sinamoeini Any updates on this?