microsoft/Tokenizer

Token count accuracy questions with GPT3.5

jsypkens opened this issue · 2 comments

I am pretty consistently seeing discrepancies between the results of this library and the actual Azure OpenAI service. For example, if I send a message that this library shows as around 4000 tokens to gpt-35-turbo, I'll get an error back from Azure saying it's around 4400 tokens, and thus over the limit for gpt-35-turbo.

I am also noticing discrepancies between the results of this and the Azure OpenAI Studio chat completions playground. The chat completions playground anecdotally seems to be roughly in the same ballpark that the Azure OpenAI API would return, so it seems like it is likely more accurate. Using that as a baseline, I ran some comparisons:

1. Azure chat completions playground (with a blank system prompt), showing 172 tokens

image

2. With the same text, this Microsoft Tokenizer library returning 140 tokens.

image

Separately, because I know there could be some background ChatML that’s not factored into this comparison, I tried to do a token-by-token breakdown. See the screenshot below. In the OpenAI playground, I went word by word and took a note of how many tokens were added to the “input tokens progress indicator” as I typed in the words, then I compared that to the “encodedTokens” from the tokenizer library (code shown for reference) and notice that there were two differences in just that first sentence:

"Tiktoken is a fast open-source tokenizer by OpenAI."

The words in bold were counted differently by the Azure OpenAI Studio and the Tiktoken library:

“open-source” --> Tiktoken breaks this into 2 tokens, but the azure playground shows this as 3
“tokenizer” --> Tiktoken shows this as 1 token, but the azure playground shows this as 2

So, in summary, the sentence above comes through as 12 tokens on the Tiktoken library, but 14 tokens on the azure playground. Anecdotally, I think the azure playground is correct. This is a small sentence but 12->14 is a 15% difference, which is very close to our general measured difference when we’re comparing the tokens we’re tracking against the tokens being shown in the Azure metrics (roughly 11% on average).

Screenshot showing the comparison, and the code used within the Tokenizer library to run this comparison:

image

More testing with the actual numbers returned from the Azure OpenAI API suggests the token counter in the Azure OpenAI Studio is incorrect as well.

It seems that the Tokenizer library is closer, actually, if we factor in 3 tokens for the system prompt ChatML and 4 tokens for each message in ChatML.

That said, I think there may be inaccuracies because the newer models don't appear to be supported in this library, yet?

https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
See the bottom of this link, where it shows a difference between gpt-3.5-turbo-0301 and gpt-3.5-turbo-0613. It doesn't appear that this library supports the 0613 models.

@jsypkens, The tokenizer algo and model haven't been changed from gpt-3.5-turbo-0301 to gpt-3.5-turbo-0613. Change is in the ChatCompletion API where special tokens are added to connect the messages together.