microsoft/onnxruntime-extensions

ClipTokenizer ignores pad_token

horixon opened this issue · 0 comments

Some clip tokenizer use cases may require a pad other than '0'. A vocab can use id '0' to represent a character/word part/word.

Below is an example using huggyface where eos token is used to pad and is padded as expected with id 49407:
`from tokenizers import Tokenizer

tokenizer_max_length = 77
tokenizer = Tokenizer.from_pretrained("openai/clip-vit-base-patch32")

tokenizer.enable_truncation(tokenizer_max_length)
tokenizer.enable_padding(pad_id=49407, length=tokenizer_max_length)

tokenizer.encode(prompt).ids`

It looks like zero is always used and the pad token is not applied.

for (size_t i = res.size(); i < max_length; i++) {