Consued about vocab and encoder
weiguowilliam opened this issue · 0 comments
weiguowilliam commented
I'm reading the source code. And I have two questions about vocab and encoder. Please help me with that. Thank you in advance.
- For vocab.bpe, I take the second row (Ġ t) for example. But I found "Ġ" appears in many rows(for example the third row). So why isn't it one-to-one correspondence?
- Are the items in encoder.json the subtokens from BPE? I take "\u0120regress" for example. Why does "\u0120" appear here?