[BUG] GPT-2 tokenizer is NOT invertible

Question

[BUG] GPT-2 tokenizer is NOT invertible

jdeschena opened this issue 4 months ago · 19 comments

System Info

Hello,

It is my understanding that the gpt-2 tokenizer, obtained with AutoTokenizer.from_pretrained("gpt2"), should be invertible. That is, given a sentence text, we should have that

text == tokenizer.decode(tokenizer(text, add_special_tokens=False)["input_ids"])

However, it is not the case, unlike the tiktoken reference implementation, which is correctly invertible.

For example, given the sentence Is this restaurant family-friendly ? Yes No Unsure ? This is a follow-up sentence ., encoding + decoding removes the space before punctuations, yielding a different sentence.

I have tried instantiating the tokenizer using GPT2Tokenizer.from_pretrained("openai-community/gpt2"), and using the options add_prefix_space=True or is_split_into_words=True, but the problem persists.

Hence, it looks like a bug to me, since BPE tokenizers should be invertible, as far as I understand.

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Run this code, and you should see the bug. I am using transformers==4.38.2

#gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2", use_fast=True)
gpt2_tokenizer =  GPT2Tokenizer.from_pretrained("openai-community/gpt2")
oai_tokenizer = tiktoken.get_encoding("gpt2")

orig = "Is this restaurant family-friendly ? Yes No Unsure ? This is an other sentence ."

hf_enc = gpt2_tokenizer(orig)["input_ids"]
hf_dec = gpt2_tokenizer.decode(hf_enc)

oai_enc = oai_tokenizer.encode(orig)
oai_dec = oai_tokenizer.decode(oai_enc)

print(hf_dec)
print(oai_dec)

Expected behavior

The two decoded sentence should be equal, yet they are not.

Answer 1 · 2024-07-10T12:18:39.000Z

Hey! Pretty sure this is due to the cleaup_tokenization_spaces argument. cc @itazap let's see if we can do a deprecation cycle for this one -> de-activate it by default but allow for it to be set in the tokenizer's parameters (tokenizer.cleanup_tokenization_spaces) as it's something that has been coming up quite a lot !

Answer 2 · 2024-07-11T12:15:45.000Z

@ArthurZucker deprecate or set to False by default (currently it is set to True by default)? If we allow it to be set, then we do not deprecate?

Answer 3 · 2024-07-11T16:37:42.000Z

We should still deprecate (if None, default to True, but next release we default to False)

Answer 4 · 2024-08-10T08:03:37.000Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Answer 5 · 2024-08-17T13:33:13.000Z

I'm getting this issue with Flux in ComfyUI and it points to this bug issue. What is the solution to resolve it? Where do I set the cleanup_tokenization_spaces parameter to false?

Full terminal output below:

got prompt
model weight dtype torch.bfloat16, manual cast: None
model_type FLUX
/home/garrett/AI/ComfyUI/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
./launch.sh: line 7: 47378 Killed                  python3 main.py

Answer 6 · 2024-08-20T01:30:20.000Z

@Garrettlynch You must have figured it out by now. For others who come here at a later point, it's a parameter for the Tokenizer function

tokenizer = T5Tokenizer.from_pretrained(model_name, legacy=False, clean_up_tokenization_spaces=True)

where model_name is the name of the model you want to use

Answer 7 · 2024-08-20T12:47:26.000Z

Yep! Closing once deprecation is fully done

Answer 8 · 2024-08-24T15:53:51.000Z

@Garrettlynch You must have figured it out by now. For others who come here at a later point, it's a parameter for the Tokenizer function

tokenizer = T5Tokenizer.from_pretrained(model_name, legacy=False, clean_up_tokenization_spaces=True)

where model_name is the name of the model you want to use

@APratham, sorry I only just saw this at the weekend. I've not found it - where is this line? Is it supposed to be in tokenization_utils_base.py (/home/garrett/AI/ComfyUI/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py), because it's not there.

Answer 9 · 2024-08-24T21:55:40.000Z

@Garrettlynch No problem at all. My comment should have been more generalised to make sure others are able to understand too. The file you're talking about contains the implementation for FromPretrained and for the clean_up_tokenization_spaces argument. That's why the FutureWarning seems to point to it.

You will need to find your Tokenizer function where you're passing from_pretrained and pass the parameters over there. I am using T5Tokenizer which is used for text-to-text models (hence wrote the relevant code), you would likely be using any of the other Tokenizers from HuggingFace Transformers.

Answer 10 · 2024-08-25T18:00:39.000Z

@Garrettlynch No problem at all. My comment should have been more generalised to make sure others are able to understand too. The file you're talking about contains the implementation for FromPretrained and for the clean_up_tokenization_spaces argument. That's why the FutureWarning seems to point to it.

You will need to find your Tokenizer function where you're passing from_pretrained and pass the parameters over there. I am using T5Tokenizer which is used for text-to-text models (hence wrote the relevant code), you would likely be using any of the other Tokenizers from HuggingFace Transformers.

I have searched all of the ComfyUI folder for 'clean_up_tokenization_spaces' and there are 47 files that return a match. These are pointing to many different models installed. I don't want to interfere with ones that are not being used in my workflow so trying to identify which one it is. I'm using the workflow downloaded from here - https://openart.ai/workflows/maitruclam/comfyui-workflow-for-flux-simple/iuRdGnfzmTbOOzONIiVV, which looks like this in ComfyUI:

I have looked through the workflow js file but it does not point to anything python.

Answer 11 · 2024-09-03T17:01:30.000Z

Also getting this running Marigold pipeline. It seems this is causing issues around diffusers where tokenization is being used, and is often being internal in the pipelines.

Please update all impacted pipelines and expose the parameter to the pipeline kwargs.

Answer 12 · 2024-09-06T11:04:21.000Z

cc @itazap and @Rocketknight1 given the sheer amount of issue related to that, let's have a PR to expose if not already the case (**tokenizer_kwargs should expose it) and let's properly document.
@itazap let's also link the PR that set it to False by default!

Answer 13 · 2024-09-06T12:30:10.000Z

The option is already in the tokenizer docstring, so I guess we just change that, since the deprecation warning has already been added? Or is there anything else we need to change?

Answer 14 · 2024-09-09T06:17:31.000Z

Excellent Approach

Answer 15 · 2024-09-17T13:08:49.000Z

Interestingly, it looks like the addition of the url of this issue in a deprecation warning in transformers has led to any Github issues sharing logs where the warning appears coming up as references to this issue 😆 Was that intentional?

Answer 16 · 2024-09-17T13:52:13.000Z

That would explain why this is the most linked issue we've ever had, lol

Answer 17 · 2024-10-01T08:05:05.000Z

Issue : warnings.warn(clip missing: ['text_projection.weight']

Issue resolved ->

Fix approach : in workflow section UNET/MODEL , you can set weight_dtype from 'default' to one of the fp8 types in your dropdown list and then re-run.

Hopyfully this can help you guys.

Answer 18 · 2024-10-15T13:22:43.000Z

Issue : warnings.warn(clip missing: ['text_projection.weight']

Issue resolved ->

Fix approach : in workflow section UNET/MODEL , you can set weight_dtype from 'default' to one of the fp8 types in your dropdown list and then re-run.

Hopyfully this can help you guys.

This resolved the issue for me - so simple.

Answer 19 · 2024-10-28T06:40:46.000Z

from transformers import MBartForConditionalGeneration, MBart50Tokenizer
from langdetect import detect

# Specify the model folder path
model_path = "model"

# Load the tokenizer and model
tokenizer = MBart50Tokenizer.from_pretrained(model_path)
model = MBartForConditionalGeneration.from_pretrained(model_path)

# Define the texts to be translated
texts_to_translate = [
    "Guten Morgen! Wie kann ich Ihnen heute helfen?",
    "Das Wetter ist heute schön. Lass uns spazieren gehen.",
    "Ich liebe es, in meiner Freizeit Bücher zu lesen。",
    "I love to read books in my spare time.",
    "Bonjour! Comment ça va?",
    "¿Cómo estás hoy?",
]

# Iterate over the texts and perform translation
for source_text in texts_to_translate:
    # Use langdetect to determine the language of the source text
    detected_language = detect(source_text)
    print(f"Detected language: {detected_language}")

    # Set the source language based on the detected language
    if detected_language == 'de':
        src_lang = "de_DE"
    elif detected_language == 'en':
        src_lang = "en_XX"
    elif detected_language == 'fr':
        src_lang = "fr_XX"
    elif detected_language == 'es':
        src_lang = "es_XX"
    else:
        print(f"Unsupported language: {detected_language}. Skipping...")
        continue  # If the language is unsupported, skip to the next text

    # Set the source language in the tokenizer
    tokenizer.src_lang = src_lang

    # Encode the source text
    encoded_text = tokenizer(source_text, return_tensors="pt")

    # Generate translation, ensuring the target language is Chinese
    generated_tokens = model.generate(
        **encoded_text,
        forced_bos_token_id=tokenizer.lang_code_to_id["zh_CN"]  # Target language code (Chinese)
    )

    # Decode the translation result
    translated_text = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]

    # Check if the translated text is valid
    if not translated_text.strip():
        translated_text = "Translation error or empty output."

    print(f"Source Text: {source_text}")
    print("Translated Text (zh_CN):", translated_text)
    print()  # Output a blank line for better readability

out:

Detected language: de
Source Text: Guten Morgen! Wie kann ich Ihnen heute helfen?
Translated Text (zh_CN): Good morning! How can I help you today?

Detected language: de
Source Text: Das Wetter ist heute schön. Lass uns spazieren gehen.
Translated Text (zh_CN): The weather is fine today. Let’s go for a walk.

Detected language: de
Source Text: Ich liebe es, in meiner Freizeit Bücher zu lesen。
Translated Text (zh_CN): I love to read books in my spare time.

Detected language: en
Source Text: I love to read books in my spare time.
Translated Text (zh_CN): 我喜欢在空闲时间读书.