Stability-AI/StableLM

Unclear tokenizer class

carmocca opened this issue · 2 comments

(Not confident on this, take with a grain of salt, this is based on a bit of quick research)
It looks like GPT-NeoX defines "HFTokenizer" as a tokenizer type, and HF defines "GPTNeoXTokenizer" as a tokenizer type. ie each project has a tokenizer type defined by the name of the other project lol. Bit confusing but makes sense.

So, if you're writing your own code to handle StableLM, the correct class to use in your own code if you want to do that, is dependent on which library you're using.

If you're using https://github.com/EleutherAI/gpt-neox - use HFTokenizer https://github.com/EleutherAI/gpt-neox/blob/main/megatron/tokenizer/tokenizer.py#L224
If you're using https://github.com/huggingface/transformers - use GPTNeoXTokenizerFast https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_neox/tokenization_gpt_neox_fast.py
(or just use autoloading libraries / copy from already-working examples, and save yourself the confusion)

That makes sense. Thank you!