questions about tokenizer

Question

questions about tokenizer

nickyoungforu opened this issue a year ago · 1 comments

hi, i run the sample code:
'''
from transformers import AutoTokenizer, OPTForCausalLM

tokenizer = AutoTokenizer.from_pretrained("facebook/galactica-6.7b")
model = OPTForCausalLM.from_pretrained("facebook/galactica-6.7b", device_map="auto")

input_text = "The Transformer architecture [START_REF]"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
'''
the input_ids is : tensor([[ 592, 23121, 5219, 243, 4]])
but the token with id 23121 in tokenizer.json is 'ĠTransformer', not the 'Transformer'.

and why do not need to add the start token <s> and the end token </s> at the beginning and end respectively?

Answer 1 · 2023-07-13T10:02:40.000Z

Hi, have a look at https://discuss.huggingface.co/t/bpe-tokenizers-and-spaces-before-words/475/2 and check out Introduction to GALACTICA Models, especially "New document mode" section.