salesforce/CodeGen2

Question of unk, bos, and eos tokens.

Opened this issue · 0 comments

Hi,

When I loaded codegen2-7b vocabulary, I found that unk, bos, and eos tokens are identical, which is confused to me since I think these three special tokens should be different. Here is my code and results. Is it normal?

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("/path/codegen2-7B")

print(f"pad token id: {tokenizer.pad_token_id}")
print(f"unk token id: {tokenizer.unk_token_id}")
print(f"bos token id: {tokenizer.bos_token_id}")
print(f"eos token id: {tokenizer.eos_token_id}")
print(f"pad token: {tokenizer.pad_token}")
print(f"unk token: {tokenizer.unk_token}")
print(f"bos token: {tokenizer.bos_token}")
print(f"eos token: {tokenizer.eos_token}")
Using pad_token, but it is not set yet.
pad token id: None
unk token id: 50256
bos token id: 50256
eos token id: 50256
pad token: None
unk token: <|endoftext|>
bos token: <|endoftext|>
eos token: <|endoftext|>