Vocab size not match model input size

Question

Vocab size not match model input size

moment-of-peace opened this issue 3 years ago · 1 comments

Why the vocab and model checkpoint provided in "II. Cross-lingual language model pretraining (XLM)" of readme don' t match? For example, the size of vocab for "tokenize + lowercase + no accent + BPE" should be 95k (the embedding size of the model), but after downloading, the vocab file actually has more than 120k lines

Answer 1 · 2021-10-20T15:08:25.000Z

Similar issue here with XLM-R 100 language model vocab file, it should have 200K vocab when the downloaded file has 239776 vocab.