Which kind of tokenizer do you use? It looks like WordPiece, not BPE.
Suxyuuu opened this issue · 1 comments
Suxyuuu commented
OpenAI's GPT-2 implementation uses BPE to make tokenizer, which needs 2 files: one is a .json file contains vocabulary, another is a .txt file contains merges.
Your implementation only uses one vocab.txt
file, and some vocabulary may start with '##', which implys from your tokenization.py
.
So do you use WordPiece not BPE?
(not English native speaker, sorry for my poor English...)
affjljoo3581 commented
I used WordPiece which is specified on this link.