tensorflow/text

Byte Level BPE Tokenizer (GPT2/RoBERTa )

sinamoeini opened this issue · 5 comments

Hi TensorFlow team,

Is there going to be a byte level bpe tokenizer in tensorflow text?

BPE is already supported by SentencePiece. If you have a sentence piece model, you can use it with the sentencepiece op.

Thank you @thuang513. So sentencepiece and byte-level BPE only differ in training phase, right? So if I have a trained byte-level BPE I should be able to use sentencepiece and just give it the vocab and merges.

I believe so. If you trained the model using SentencePiece you should be able to use it as is.

I dug a bit more and it seems that there is a small difference in how they treat spaces. I will try to replicate hugging face RoBERTa tokenizer using tensorflow sentence piece tokenizer and update this thread

@sinamoeini Any updates on this?