decode token one by one
Closed this issue · 1 comments
nigelzzz commented
taku910 commented
Probably we could decode the id directly with id_to_piece, though it is not always the same as decode method.
>>> import sentencepiece as spm
>>> sp = spm.SentencePieceProcessor(model_file='test_model.model')
>>> ids = sp.encode('hello world. sentencepiece is a language independent tokenizer.')
>>> ids
[39, 88, 21, 887, 6, 331, 15, 256, 29, 25, 16, 135, 47, 11, 960, 20, 981, 109, 10, 46, 98, 25, 997, 40, 6]
>>> s = ''
>>> for id in ids:
... s += sp.id_to_piece(id)
... print(s.replace('▁', ' ').lstrip(' '))
...
he
hell
hello
hello world
hello world.
hello world. sen
hello world. sent
hello world. sentence
hello world. sentencep
hello world. sentencepi
hello world. sentencepie
hello world. sentencepiece
hello world. sentencepiece is
hello world. sentencepiece is a
hello world. sentencepiece is a language
hello world. sentencepiece is a language in
hello world. sentencepiece is a language independ
hello world. sentencepiece is a language independent
hello world. sentencepiece is a language independent to
hello world. sentencepiece is a language independent tok
hello world. sentencepiece is a language independent token
hello world. sentencepiece is a language independent tokeni
hello world. sentencepiece is a language independent tokeniz
hello world. sentencepiece is a language independent tokenizer
hello world. sentencepiece is a language independent tokenizer.
>>>