AndraxDev/speak-gpt

playground token counting is misleading as it doesn't use a given model's tokenizer

Closed this issue · 1 comments

Hi,

I think the playground token counting is misleading because many models don't provide their tokenizer so we can't know the token ids and sometimes not even the length.

For example anthropic allows knowing the number of tokens of a string but not their IDs and \t\t counts as 1 token for Claude but 2 for openai's gpt-3.5 and 4 models. For some python code that can make a huge difference!

I think it might be best to print to the user if the modelname used does not correspond to the tokenizer used, just before printing the token info, instead of just telling a token count without disclaimer.

This function is experimental and may be removed in the future. It is not accepting new ideas or bug reports.