Weird inconsistency in Tokenizer vocabulary
javirandor opened this issue · 1 comments
Hello everyone!
I found a weird inconsistency in the tokenizer vocabulary. I wanted to ask why this could be happening.
I have loaded a tokenizer from HF:
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-160m")
If I run
tokenizer.encode("\u200b")
The output is [12882]
. However, taking a look at the vocabulary used for training (here), I cannot find the token \u200b
and the token id corresponds to a different string
"\u00e2\u0122\u012d": 12882,
This seems to generally happen with unicode characters.
Why could this be happening?? I just want to make sure that the tokenizer I use for training is equivalent to the HF tokenizers since my training (as anticipated in your README) results in a weird tokenizer.
Thanks a lot :)
I don't know exactly what's going on here yet, but I can confirm this file at utils/20B_tokenizer.json
is precisely the one used for vocab_file
during Pythia training.
also the following snippet shows the result upon loading the two tokenizers and encoding \u200b:
>>> tok1 = transformers.PretrainedTokenizerFast.from_file("utils/20B_tokenizer.json")
>>> tok2 = transformers.AutoTokenizer.from_pretrained("EleutherAI/pythia-160m")
>>> tok1("\u200b")
{'input_ids': [12882], 'token_type_ids': [0], 'attention_mask': [1]}
>>> tok2("\u200b")
{'input_ids': [12882], 'attention_mask': [1]}