Is there a defined pad_token in the vocabulary?

Question

Is there a defined pad_token in the vocabulary?

Closed this issue a year ago · 6 comments

I'm trying to mask a specific text part in the fine-tuning label. But while using eos_token for masking, the model is overfitting to <|endoftext|>. So is there a pad_token or any other token defined in the vocabulary which can be used in this case?

Answer 1 · 2023-07-01T14:35:54.000Z

We don't have a pad_token. You can also try adding a new special token that is not in the vocab. See https://huggingface.co/Salesforce/xgen-7b-8k-base/blob/main/tokenization_xgen.py

Answer 2 · 2023-07-01T17:43:11.000Z

I have tried adding like this. tokenizer.add_special_tokens({'pad_token': '<|padding|>'})

But instead of adding a new token, it's replacing the '<' token.

Is there a method to be followed for adding a special token?

Answer 3 · 2023-07-01T19:06:57.000Z

I have tried adding like this. tokenizer.add_special_tokens({'pad_token': '<|padding|>'})

But instead of adding a new token, it's replacing the '<' token.

Is there a method to be followed for adding a special token?

try
tokenizer = AutoTokenizer.from_pretrained("Salesforce/xgen-7b-8k-base", pad_token='<|padding|>', trust_remote_code=True)

Answer 4 · 2023-07-01T19:51:15.000Z

Same behavior for this as well. <|padding|> get assigned to token_id 27. Token_id 27 was previously the id of '<'

Answer 5 · 2023-07-01T20:01:34.000Z

Same behavior for this as well. <|padding|> get assigned to token_id 27. Token_id 27 was previously the id of '<'

Maybe the old version of the tokenizer script was cached. Can you delete the cached file, and try again?

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Salesforce/xgen-7b-8k-base", pad_token='<|padding|>', trust_remote_code=True)

print(tokenizer.pad_token_id)
print(tokenizer.pad_token)

Here's what I got:

50313
<|padding|>

Answer 6 · 2023-07-01T20:12:35.000Z

Thank you. It was the cache of previous tokenizer.