questions about tokenizer
nickyoungforu opened this issue · 1 comments
hi, i run the sample code:
'''
from transformers import AutoTokenizer, OPTForCausalLM
tokenizer = AutoTokenizer.from_pretrained("facebook/galactica-6.7b")
model = OPTForCausalLM.from_pretrained("facebook/galactica-6.7b", device_map="auto")
input_text = "The Transformer architecture [START_REF]"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
'''
the input_ids is : tensor([[ 592, 23121, 5219, 243, 4]])
but the token with id 23121 in tokenizer.json is 'ĠTransformer', not the 'Transformer'.
and why do not need to add the start token <s> and the end token </s> at the beginning and end respectively?
Hi, have a look at https://discuss.huggingface.co/t/bpe-tokenizers-and-spaces-before-words/475/2 and check out Introduction to GALACTICA Models, especially "New document mode" section.