gkamradt/LLMTest_NeedleInAHaystack

Standard Tokenizer

Closed this issue ยท 12 comments

Before proceeding with the implementation, I would like to reach a consensus

To ensure a fair and consistent evaluation of different language models, I propose standardizing our tokenizer.

Currently, we use different tokenizers, including cl100k_base (tiktoken) for OpenAI models and an unspecified tokenizer from Anthropic. This lack of standardization introduces bias, as tokenizers vary in how they split text into units, thus affecting context length calculations.

I recommend adopting cl100k_base as our standard tokenizer due to its open-source availability. This will create a level playing field for model comparisons. This difference is less significant for shorter contexts but becomes more pronounced as context length increases.

Using same tokenizer would not affect the integrity as in this project, tokenizer is used to measure the context length and find the depth for needle.

Results from my testing, Anthropic uses more token to represents the same length text. code in this colab

image

Hi @prabha-git , I would like to dis-agree for using a standard tokenizer.

Tokenizers are coupled with models. Visit https://tiktokenizer.vercel.app/ and see that the tokens generated for GPT2 vs GPT4 are different. They have different vocabularies, which differ in merges as well as vocabulary size.

I understand that it is not a level playing field. But it not something we can solve. The whole training of the models is done using tokens from a specific tokenizer.

Model inference will fail or behave in a random manner if we provide tokens from a tokenizer that is not compatible with it.

I would like to close the issue, as this is not feasible.

@kedarchandrayan we can't choose a tokenizer for any pre-trained transformer model, So I am not proposing that :)

Let me clarify this in detail, hope my explanation effectively conveys my points.

In our project, we utilize the tokenizer corresponding to the model under test, as demonstrated in the following example for Anthropic Models:

self.tokenizer = AnthropicModel().get_tokenizer()

Why is this tokenizer necessary for our project? The insert_needle method evaluates the context length and the depth at which the needle is inserted, on the encoded string (using a tokenizer). Effectively tokenizer is used to measure the length. We decode the text after the needle is inserted and send it for inference.

However, this approach introduces a bias, especially noticeable in longer contexts. For instance, if we set the context size to 1M tokens, Anthropic's model might encompass approximately 3M characters, whereas OpenAI's model might include around 3.2M characters from your document.

Therefore, I propose using the same tokenizer for measuring the context length and determining the depth at which to insert the needle.

Additionally, for some models, like Gemini Pro, accessing their specific tokenizer isn't possible, so we must rely on publicly available tokenizers anyway.

@prabha-git The bias which you are pointing to is when you think from a human point (character length). For model, the input is always tokenized. We are testing the ability of a model to find a needle in a haystack of tokens. That's why keeping everything in terms of native tokens is more intuitive from model's perspective.

When you say that 1M tokens mean 3M characters for Anthropic's model and 3.2M characters for OpenAI's model, you are comparing them from human perspective. For the models these are 1M tokens. They were trained to work with tokens like these.

Following are some subtle points which I want to make:

  • Tokenizer is also used for limiting the context to a context length in encode_and_trim. If we use standard tokenizer, this can result in a context which is having different number of tokens than the maximum allowed tokens for that model.
  • Placing a needle at a depth percent according to a standard tokenizer might insert the needle at a different position as compared to the native tokenizer from the model.
  • From a model's perspective, it makes more sense in inserting the needle at correct depth percent according to it's own tokenizer. This point becomes even more relevant in case of multi-needle testing, where there are multiple depths to consider.

I agree with the point on Gemini Pro. We will have to think of something else. That is a problem to solve.

Please let me know your opinion.

Thank you, @prabha-git, for initiating this discussion. It's crucial we address this issue before we proceed with integrating models like Gemini, for which we lack a tokenizer.

If I understand you correctly:

  • @prabha-git believes that the discrepancy when using a standard tokenizer across all models is acceptable.
  • @kedarchandrayan believes that the discrepancy when using a standard tokenizer across all models is not acceptable.

The following chart I've created illustrates the percentage difference in the number of tokens between cl100k_base (GPT-3.5/4) and various other tokenizers when representing the same text translated into multiple languages.

model_lang_distributions_difference_to_gpt4

From the chart above, we can interpret that:

  • Gemini Pro requires approximately 15% fewer tokens to represent the same text, compared to GPT-3.5/4.
  • Mistral and other models require approximately 15% more tokens to represent the same text, compared to GPT-3.5/4.

Let's explore whether a 15% discrepancy is tolerable, or if there exists an alternative solution that could reduce this error to 0%.

@pavelkraleu - that's cool that you compared across different models.

I think I am making rather a simple point, Considering this is a testing framework for various LLMs. Do we want to measure all models with the same Yardstick or the Yardstick provided by the Model Provider? If we are just testing one model and the intention is to NOT compare the results with other models, then it is fine to use the tokenizer provided by the model.

If we are using a tokenizer provided by the model provider, Comparisons like this would be misleading in a larger context window. image is from this post,

image

I think we don't need to worry about what tokenizer a specific model is using. We measure the context length and depth using a standard tokenizer and pass the context to LLM for inference.
If it finds the needle correctly, it is green for the given context and depth. And of course, we highlight that we are using a standardized tokenizer in the documentation. Thoughts?

@prabha-git, this subject is one of my major interests, and I will be speaking about it at PyCon SK next week, so I have many charts like this lying around. ๐Ÿ™‚

Don't you think that if we use cl100k_base for Gemini, for example, something like this may happen? Because Gemini tokenizer is more efficient in encoding information, will we never test Gemini's full depth?

Unknown

I am new to this discussion, but if I read @pavelkraleu correctly, then an inherent problem would exist with using a general tokenizer. We could blow past or under place the pin within the context window by using a tokenizer different than the one the model is using.

This seems to create a couple problems though. There is a new great model that is released BibbityBop, if BibbityBop doesn't use a tokenizer that is already implemented in the NIAH then the above scenario would happen. Wouldn't this by extension mean, that the pin placement in the context window could never be perfectly accurate unless the exact tokenizer as the model uses were used?

Exactly, @douglasdlewis2012, that's what I think will be happening ๐Ÿ™‚

However, we are not completely lost.
All the LLM APIs I know return something like prompt_tokens and completion_tokens, so I think we can work around it and count tokens differently.

I'm with the majority here - I would also recommend we do not use a standard tokenizer across models.

In practice we like to say "oh Claude has 200K tokens and GPT has 128K tokens" as if they are the same thing, but each model users a different tokenizer so it really should be, "oh Claude has 200K tokens using tokenizer X and GPT has 128K tokens using tokenizer Y"

Because we are doing length evaluation, we'll need to get as close as possible to do the tokenizer used by the model (ideally the same one).

Because we won't have all the tokenizers, we'll need to have a backup or default and adjust as we get more information.

@prabha-git, this subject is one of my major interests, and I will be speaking about it at PyCon SK next week, so I have many charts like this lying around. ๐Ÿ™‚

Don't you think that if we use cl100k_base for Gemini, for example, something like this may happen? Because Gemini tokenizer is more efficient in encoding information, will we never test Gemini's full depth?

Unknown

Wow Cool, Do they upload the presentation to YouTube? Will look it up :)

I'm with the majority here - I would also recommend we do not use a standard tokenizer across models.

In practice we like to say "oh Claude has 200K tokens and GPT has 128K tokens" as if they are the same thing, but each model users a different tokenizer so it really should be, "oh Claude has 200K tokens using tokenizer X and GPT has 128K tokens using tokenizer Y"

Because we are doing length evaluation, we'll need to get as close as possible to do the tokenizer used by the model (ideally the same one).

Because we won't have all the tokenizers, we'll need to have a backup or default and adjust as we get more information.

Sounds good. I have seen people who don't realize that these models don't use the same tokenizer. but yeah with a Standardized tokenizer, we may not get the same length in x-axis as the model's max context window.

Thanks for the discussion, will close this issue.

I'm with the majority here - I would also recommend we do not use a standard tokenizer across models.
In practice we like to say "oh Claude has 200K tokens and GPT has 128K tokens" as if they are the same thing, but each model users a different tokenizer so it really should be, "oh Claude has 200K tokens using tokenizer X and GPT has 128K tokens using tokenizer Y"
Because we are doing length evaluation, we'll need to get as close as possible to do the tokenizer used by the model (ideally the same one).
Because we won't have all the tokenizers, we'll need to have a backup or default and adjust as we get more information.

Sounds good. I have seen people who don't realize that these models don't use the same tokenizer. but yeah with a Standardized tokenizer, we may not get the same length in x-axis as the model's max context window.

Thanks for the discussion, will close this issue.

Thanks @prabha-git , it was great discussing on the topic. I liked the articulation in terms of graphs and visual images. Please keep suggesting issues and initiate discussions ๐Ÿ‘