Which kind of tokenizer do you use? It looks like WordPiece, not BPE.

Question

Which kind of tokenizer do you use? It looks like WordPiece, not BPE.

Suxyuuu opened this issue 2 years ago · 1 comments

OpenAI's GPT-2 implementation uses BPE to make tokenizer, which needs 2 files: one is a .json file contains vocabulary, another is a .txt file contains merges.
Your implementation only uses one vocab.txt file, and some vocabulary may start with '##', which implys from your tokenization.py.
So do you use WordPiece not BPE?
(not English native speaker, sorry for my poor English...)

Answer 1 · 2023-04-05T01:09:21.000Z

I used WordPiece which is specified on this link.