[BUG] GPT-2 tokenizer is NOT invertible
jdeschena opened this issue · 19 comments
System Info
Hello,
It is my understanding that the gpt-2 tokenizer, obtained with AutoTokenizer.from_pretrained("gpt2")
, should be invertible. That is, given a sentence text
, we should have that
text == tokenizer.decode(tokenizer(text, add_special_tokens=False)["input_ids"])
However, it is not the case, unlike the tiktoken
reference implementation, which is correctly invertible.
For example, given the sentence Is this restaurant family-friendly ? Yes No Unsure ? This is a follow-up sentence .
, encoding + decoding removes the space before punctuations, yielding a different sentence.
I have tried instantiating the tokenizer using GPT2Tokenizer.from_pretrained("openai-community/gpt2")
, and using the options add_prefix_space=True
or is_split_into_words=True
, but the problem persists.
Hence, it looks like a bug to me, since BPE tokenizers should be invertible, as far as I understand.
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Run this code, and you should see the bug. I am using transformers==4.38.2
#gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2", use_fast=True)
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("openai-community/gpt2")
oai_tokenizer = tiktoken.get_encoding("gpt2")
orig = "Is this restaurant family-friendly ? Yes No Unsure ? This is an other sentence ."
hf_enc = gpt2_tokenizer(orig)["input_ids"]
hf_dec = gpt2_tokenizer.decode(hf_enc)
oai_enc = oai_tokenizer.encode(orig)
oai_dec = oai_tokenizer.decode(oai_enc)
print(hf_dec)
print(oai_dec)
Expected behavior
The two decoded sentence should be equal, yet they are not.
Hey! Pretty sure this is due to the cleaup_tokenization_spaces
argument. cc @itazap let's see if we can do a deprecation cycle for this one -> de-activate it by default but allow for it to be set in the tokenizer's parameters (tokenizer.cleanup_tokenization_spaces) as it's something that has been coming up quite a lot !
@ArthurZucker deprecate or set to False by default (currently it is set to True by default)? If we allow it to be set, then we do not deprecate?
We should still deprecate (if None, default to True, but next release we default to False)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I'm getting this issue with Flux in ComfyUI and it points to this bug issue. What is the solution to resolve it? Where do I set the cleanup_tokenization_spaces parameter to false?
Full terminal output below:
got prompt
model weight dtype torch.bfloat16, manual cast: None
model_type FLUX
/home/garrett/AI/ComfyUI/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
./launch.sh: line 7: 47378 Killed python3 main.py
@Garrettlynch You must have figured it out by now. For others who come here at a later point, it's a parameter for the Tokenizer function
tokenizer = T5Tokenizer.from_pretrained(model_name, legacy=False, clean_up_tokenization_spaces=True)
where model_name
is the name of the model you want to use
Yep! Closing once deprecation is fully done
@Garrettlynch You must have figured it out by now. For others who come here at a later point, it's a parameter for the Tokenizer function
tokenizer = T5Tokenizer.from_pretrained(model_name, legacy=False, clean_up_tokenization_spaces=True)
where
model_name
is the name of the model you want to use
@APratham, sorry I only just saw this at the weekend. I've not found it - where is this line? Is it supposed to be in tokenization_utils_base.py (/home/garrett/AI/ComfyUI/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py), because it's not there.
@Garrettlynch No problem at all. My comment should have been more generalised to make sure others are able to understand too. The file you're talking about contains the implementation for FromPretrained
and for the clean_up_tokenization_spaces
argument. That's why the FutureWarning seems to point to it.
You will need to find your Tokenizer function where you're passing from_pretrained
and pass the parameters over there. I am using T5Tokenizer which is used for text-to-text models (hence wrote the relevant code), you would likely be using any of the other Tokenizers from HuggingFace Transformers.
@Garrettlynch No problem at all. My comment should have been more generalised to make sure others are able to understand too. The file you're talking about contains the implementation for
FromPretrained
and for theclean_up_tokenization_spaces
argument. That's why the FutureWarning seems to point to it.You will need to find your Tokenizer function where you're passing
from_pretrained
and pass the parameters over there. I am using T5Tokenizer which is used for text-to-text models (hence wrote the relevant code), you would likely be using any of the other Tokenizers from HuggingFace Transformers.
I have searched all of the ComfyUI folder for 'clean_up_tokenization_spaces' and there are 47 files that return a match. These are pointing to many different models installed. I don't want to interfere with ones that are not being used in my workflow so trying to identify which one it is. I'm using the workflow downloaded from here - https://openart.ai/workflows/maitruclam/comfyui-workflow-for-flux-simple/iuRdGnfzmTbOOzONIiVV, which looks like this in ComfyUI:
I have looked through the workflow js file but it does not point to anything python.
Also getting this running Marigold pipeline. It seems this is causing issues around diffusers where tokenization is being used, and is often being internal in the pipelines.
Please update all impacted pipelines and expose the parameter to the pipeline kwargs.
cc @itazap and @Rocketknight1 given the sheer amount of issue related to that, let's have a PR to expose if not already the case (**tokenizer_kwargs should expose it) and let's properly document.
@itazap let's also link the PR that set it to False
by default!
The option is already in the tokenizer docstring, so I guess we just change that, since the deprecation warning has already been added? Or is there anything else we need to change?
Excellent Approach
Interestingly, it looks like the addition of the url of this issue in a deprecation warning in transformers
has led to any Github issues sharing logs where the warning appears coming up as references to this issue 😆 Was that intentional?
That would explain why this is the most linked issue we've ever had, lol
Issue : warnings.warn(clip missing: ['text_projection.weight']
Issue resolved ->
Fix approach : in workflow section UNET/MODEL , you can set weight_dtype from 'default' to one of the fp8 types in your dropdown list and then re-run.
Hopyfully this can help you guys.
This resolved the issue for me - so simple.
from transformers import MBartForConditionalGeneration, MBart50Tokenizer
from langdetect import detect
# Specify the model folder path
model_path = "model"
# Load the tokenizer and model
tokenizer = MBart50Tokenizer.from_pretrained(model_path)
model = MBartForConditionalGeneration.from_pretrained(model_path)
# Define the texts to be translated
texts_to_translate = [
"Guten Morgen! Wie kann ich Ihnen heute helfen?",
"Das Wetter ist heute schön. Lass uns spazieren gehen.",
"Ich liebe es, in meiner Freizeit Bücher zu lesen。",
"I love to read books in my spare time.",
"Bonjour! Comment ça va?",
"¿Cómo estás hoy?",
]
# Iterate over the texts and perform translation
for source_text in texts_to_translate:
# Use langdetect to determine the language of the source text
detected_language = detect(source_text)
print(f"Detected language: {detected_language}")
# Set the source language based on the detected language
if detected_language == 'de':
src_lang = "de_DE"
elif detected_language == 'en':
src_lang = "en_XX"
elif detected_language == 'fr':
src_lang = "fr_XX"
elif detected_language == 'es':
src_lang = "es_XX"
else:
print(f"Unsupported language: {detected_language}. Skipping...")
continue # If the language is unsupported, skip to the next text
# Set the source language in the tokenizer
tokenizer.src_lang = src_lang
# Encode the source text
encoded_text = tokenizer(source_text, return_tensors="pt")
# Generate translation, ensuring the target language is Chinese
generated_tokens = model.generate(
**encoded_text,
forced_bos_token_id=tokenizer.lang_code_to_id["zh_CN"] # Target language code (Chinese)
)
# Decode the translation result
translated_text = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
# Check if the translated text is valid
if not translated_text.strip():
translated_text = "Translation error or empty output."
print(f"Source Text: {source_text}")
print("Translated Text (zh_CN):", translated_text)
print() # Output a blank line for better readability
out:
Detected language: de
Source Text: Guten Morgen! Wie kann ich Ihnen heute helfen?
Translated Text (zh_CN): Good morning! How can I help you today?
Detected language: de
Source Text: Das Wetter ist heute schön. Lass uns spazieren gehen.
Translated Text (zh_CN): The weather is fine today. Let’s go for a walk.
Detected language: de
Source Text: Ich liebe es, in meiner Freizeit Bücher zu lesen。
Translated Text (zh_CN): I love to read books in my spare time.
Detected language: en
Source Text: I love to read books in my spare time.
Translated Text (zh_CN): 我喜欢在空闲时间读书.