wwm模型加载的时候tokenizer出来都是一个个的字，这样对吗？

Question

wwm模型加载的时候tokenizer出来都是一个个的字，这样对吗？

rmbone opened this issue 3 years ago · 2 comments

from transformers import AutoTokenizer
tokenizer_auto = AutoTokenizer.from_pretrained(“hfl/chinese-roberta-wwm-ext-large”)

tokens2 = tokenizer_auto("使用语言模型来预测下一个词的probability。")
print(tokens2)
print(tokenizer_auto.decode(tokens2["input_ids"]))

{'input_ids': [101, 886, 4500, 6427, 6241, 3563, 1798, 3341, 7564, 3844, 678, 671, 702, 6404, 4638, 8376, 8668, 13254, 511, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
[CLS] 使用语言模型来预测下一个词的 probability 。 [SEP]

怎么没见模 ##型这样的呢？？

Answer 1 · 2021-10-04T19:56:46.000Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Answer 2 · 2021-10-09T01:25:19.000Z

Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.