huggingface/tokenizers

New Update causes add_special_tokens not recognized

sravell opened this issue · 3 comments

Regardless of if add_special_tokens is used or not it causes: Keyword arguments {'add_special_tokens': False} not recognized.

When it is being used to add new tokens, it does not work at all.

model_name = "meta-llama/Llama-2-7b-hf"  # Example model name "meta-llama/Llama-2-7b-hf"
model = LlamaForCausalLM.from_pretrained(model_name)
tokenizer = LlamaTokenizer.from_pretrained(model_name)

input_text = "The science of today is the technology of tomorrow."
input_ids = tokenizer.encode(input_text, return_tensors="pt")


output = model.generate(input_ids, max_length=50)

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

Output: Keyword arguments {'add_special_tokens': False} not recognized.
The science of today is the technology of tomorrow.
The future of science is the future of technology.
Science is the study of the natural world, and technology is the study of how to use that knowledge to improve our lives.

A temporary suppression.

Found that it is caused by transformers/tokenization_utils.py:562.

I just added _ = kwargs.pop("add_special_tokens", self.add_special_tokens) to line 557.

This is how it looks like:

    def tokenize(self, text: TextInput, **kwargs) -> List[str]:
        """
        Converts a string into a sequence of tokens, using the tokenizer.

        Split in words for word-based vocabulary or sub-words for sub-word-based vocabularies
        (BPE/SentencePieces/WordPieces). Takes care of added tokens.

        Args:
            text (`str`):
                The sequence to be encoded.
            **kwargs (additional keyword arguments):
                Passed along to the model-specific `prepare_for_tokenization` preprocessing method.

        Returns:
            `List[str]`: The list of tokens.
        """
        split_special_tokens = kwargs.pop("split_special_tokens", self.split_special_tokens)
        _ = kwargs.pop("add_special_tokens", self.add_special_tokens) # added line
        
        # remaining code ...

Result:

$ python
Python 3.11.7 (main, Dec 15 2023, 18:12:31) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from transformers import LlamaTokenizer
>>> import torch
>>> MODEL_PRECISION = torch.float32
>>> LLAMA_MODEL = "meta-llama/Llama-2-7b-chat-hf"
>>> tokenizer = tokenizer = LlamaTokenizer.from_pretrained(LLAMA_MODEL, device_map=0, torch_dtype=MODEL_PRECISION)
>>> tokenizer.encode("hi")
[1, 7251]

Same issue. Good idea yjsoh. I added the same line.