ClipTokenizer ignores pad_token
horixon opened this issue · 0 comments
horixon commented
Some clip tokenizer use cases may require a pad other than '0'. A vocab can use id '0' to represent a character/word part/word.
Below is an example using huggyface where eos token is used to pad and is padded as expected with id 49407:
`from tokenizers import Tokenizer
tokenizer_max_length = 77
tokenizer = Tokenizer.from_pretrained("openai/clip-vit-base-patch32")
tokenizer.enable_truncation(tokenizer_max_length)
tokenizer.enable_padding(pad_id=49407, length=tokenizer_max_length)
tokenizer.encode(prompt).ids`
It looks like zero is always used and the pad token is not applied.