Unclear tokenizer class
carmocca opened this issue · 2 comments
The repo yaml points to HFTokenizer
https://github.com/Stability-AI/StableLM/blob/e60081/configs/stablelm-base-alpha-3b.yaml#L108
But the HF upload points to GPTNeoXTokenizer
https://huggingface.co/stabilityai/stablelm-base-alpha-3b/blob/main/tokenizer_config.json#L7
Which one is correct?
(Not confident on this, take with a grain of salt, this is based on a bit of quick research)
It looks like GPT-NeoX defines "HFTokenizer" as a tokenizer type, and HF defines "GPTNeoXTokenizer" as a tokenizer type. ie each project has a tokenizer type defined by the name of the other project lol. Bit confusing but makes sense.
So, if you're writing your own code to handle StableLM, the correct class to use in your own code if you want to do that, is dependent on which library you're using.
If you're using https://github.com/EleutherAI/gpt-neox - use HFTokenizer
https://github.com/EleutherAI/gpt-neox/blob/main/megatron/tokenizer/tokenizer.py#L224
If you're using https://github.com/huggingface/transformers - use GPTNeoXTokenizerFast
https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_neox/tokenization_gpt_neox_fast.py
(or just use autoloading libraries / copy from already-working examples, and save yourself the confusion)
That makes sense. Thank you!