Ability to count tokens for models other than OpenAI

Question

Ability to count tokens for models other than OpenAI

simonw opened this issue a year ago · 3 comments

Had a great tip on Discord about tokenziers - which says: https://huggingface.co/docs/tokenizers/python/latest/quicktour.html#using-a-pretrained-tokenizer

You can load any tokenizer from the Hugging Face Hub as long as a tokenizer.json file is available in the repository.

And sure enough, this seems to work:

>>> import tokenizers
>>> from tokenizers import Tokenizer
>>> tokenizer = Tokenizer.from_pretrained("TheBloke/Llama-2-70B-fp16")
Downloaded 1.76MiB in 0s
>>> tokenizer.encode("hello world")
Encoding(num_tokens=3, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

Answer 1 · 2023-09-10T02:14:45.000Z

Anthropic have a tokenizer too: https://github.com/anthropics/anthropic-sdk-python/blob/main/src/anthropic/_tokenizers.py

Answer 2 · 2024-03-06T13:09:32.000Z

what if you don't know the origin of the model? all you have to go by is the name of the model.

is there baked-in metadata we can read that tells us what tokenizer to use?

Answer 3 · 2024-07-06T23:04:09.000Z

So what exactly can we use for Claude models? E.g., Sonnet 3.5.