salesforce/xgen

Is there a defined pad_token in the vocabulary?

Closed this issue · 6 comments

I'm trying to mask a specific text part in the fine-tuning label. But while using eos_token for masking, the model is overfitting to <|endoftext|>. So is there a pad_token or any other token defined in the vocabulary which can be used in this case?

We don't have a pad_token. You can also try adding a new special token that is not in the vocab. See https://huggingface.co/Salesforce/xgen-7b-8k-base/blob/main/tokenization_xgen.py

I have tried adding like this. tokenizer.add_special_tokens({'pad_token': '<|padding|>'})

But instead of adding a new token, it's replacing the '<' token.

Is there a method to be followed for adding a special token?

I have tried adding like this. tokenizer.add_special_tokens({'pad_token': '<|padding|>'})

But instead of adding a new token, it's replacing the '<' token.

Is there a method to be followed for adding a special token?

try
tokenizer = AutoTokenizer.from_pretrained("Salesforce/xgen-7b-8k-base", pad_token='<|padding|>', trust_remote_code=True)

Same behavior for this as well. <|padding|> get assigned to token_id 27. Token_id 27 was previously the id of '<'

Same behavior for this as well. <|padding|> get assigned to token_id 27. Token_id 27 was previously the id of '<'

Maybe the old version of the tokenizer script was cached. Can you delete the cached file, and try again?

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Salesforce/xgen-7b-8k-base", pad_token='<|padding|>', trust_remote_code=True)

print(tokenizer.pad_token_id)
print(tokenizer.pad_token)

Here's what I got:

50313
<|padding|>

Thank you. It was the cache of previous tokenizer.