tensorflow/text

Generate BERT wordpiece vocab?

xwli-chelsea opened this issue · 2 comments

Hi There!

I noticed there are scripts that seem to be related to learning wordpiece vocab under https://github.com/tensorflow/text/tree/master/tools/wordpiece_vocab.

We want to pre-train BERT models from scratch, which requires learning the wordpiece vocabulary. I'm wondering:

  • Can we use the tools provided here to learn the wordpiece vocab?
  • Is it compatible with the wordpiece tokenizer in tf.text (so that we can use the WordpieceTokenizer for writing preprocessing ops in fine-tuning graphs)?

Thanks a lot!

Yes, you can use [1] to generate a wordpiece vocab file that can then be used in BertTokenizer or WordpieceTokenizer

[1] https://github.com/tensorflow/text/blob/master/tools/wordpiece_vocab/generate_vocab.py

BTW do you have any benchmark numbers or rough estimations about the preprocessing time in graph mode of using sentence piece BPE vs wordpiece implementations? Our goal is to also optimize for inference latency so we’d appreciate any suggestions on which one to use. Thanks! Cc @thuang513