huggingface/transformers

Serialization error when tokenizer_config key matches function name in PreTrainedTokenizerBase

avnermay opened this issue · 2 comments

for k in target_keys:
if hasattr(self, k):
tokenizer_config[k] = getattr(self, k)

When one of the keys in self.init_kwargs matches the name of a function in PreTrainedTokenizerBase (e.g., add_special_tokens), this for loops replaces the value for that key in tokenizer_config with the function object, which is not serializable, thus causing an error during save_pretrained.

To solve this issue, one option is to add an assert in the __init__ function that throws an error if one of the keys matches an existing attribute/function on the PreTrainedTokenizerBase:

self.init_kwargs = copy.deepcopy(kwargs)

This error was also raised in the Stack Overflow issue below:
https://stackoverflow.com/questions/78062739/huggingface-transformers-error-when-saving-model-typeerror-object-of-type-meth

Yep, this is known. I remember saying that I'd rather have a failure than duplicate attribute / functions.
Do you want to open a PR to add some kind of check?
I am fine with doing this in the init as long as it does not slow it down too much