Unclear tokenizer class

Question

Unclear tokenizer class

carmocca opened this issue 2 years ago · 2 comments

The repo yaml points to HFTokenizer https://github.com/Stability-AI/StableLM/blob/e60081/configs/stablelm-base-alpha-3b.yaml#L108

But the HF upload points to GPTNeoXTokenizer https://huggingface.co/stabilityai/stablelm-base-alpha-3b/blob/main/tokenizer_config.json#L7

Which one is correct?

Answer 1 · 2023-05-11T19:34:10.000Z

(Not confident on this, take with a grain of salt, this is based on a bit of quick research)
It looks like GPT-NeoX defines "HFTokenizer" as a tokenizer type, and HF defines "GPTNeoXTokenizer" as a tokenizer type. ie each project has a tokenizer type defined by the name of the other project lol. Bit confusing but makes sense.

So, if you're writing your own code to handle StableLM, the correct class to use in your own code if you want to do that, is dependent on which library you're using.

If you're using https://github.com/EleutherAI/gpt-neox - use HFTokenizer https://github.com/EleutherAI/gpt-neox/blob/main/megatron/tokenizer/tokenizer.py#L224
If you're using https://github.com/huggingface/transformers - use GPTNeoXTokenizerFast https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_neox/tokenization_gpt_neox_fast.py
(or just use autoloading libraries / copy from already-working examples, and save yourself the confusion)

Answer 2 · 2023-05-12T01:56:08.000Z

That makes sense. Thank you!