affjljoo3581/GPT2

Which kind of tokenizer do you use? It looks like WordPiece, not BPE.

Closed this issue · 1 comments

OpenAI's GPT-2 implementation uses BPE to make tokenizer, which needs 2 files: one is a .json file contains vocabulary, another is a .txt file contains merges.
Your implementation only uses one vocab.txt file, and some vocabulary may start with '##', which implys from your tokenization.py.
So do you use WordPiece not BPE?
(not English native speaker, sorry for my poor English...)

I used WordPiece which is specified on this link.