google/sentencepiece

decode token one by one

Closed this issue · 1 comments

Hi @taku910 ,
base on this case #1043
I got it, thanks, because i see other llm appilication decode token one by one.
if i need to implement it, do you have any suggestion

Probably we could decode the id directly with id_to_piece, though it is not always the same as decode method.

>>> import sentencepiece as spm
>>> sp = spm.SentencePieceProcessor(model_file='test_model.model')
>>> ids = sp.encode('hello world. sentencepiece is a language independent tokenizer.')
>>> ids
[39, 88, 21, 887, 6, 331, 15, 256, 29, 25, 16, 135, 47, 11, 960, 20, 981, 109, 10, 46, 98, 25, 997, 40, 6]
>>> s = ''
>>> for id in ids:
...     s += sp.id_to_piece(id)
...     print(s.replace('▁', ' ').lstrip(' '))
... 
he
hell
hello
hello world
hello world.
hello world. sen
hello world. sent
hello world. sentence
hello world. sentencep
hello world. sentencepi
hello world. sentencepie
hello world. sentencepiece
hello world. sentencepiece is
hello world. sentencepiece is a
hello world. sentencepiece is a language
hello world. sentencepiece is a language in
hello world. sentencepiece is a language independ
hello world. sentencepiece is a language independent
hello world. sentencepiece is a language independent to
hello world. sentencepiece is a language independent tok
hello world. sentencepiece is a language independent token
hello world. sentencepiece is a language independent tokeni
hello world. sentencepiece is a language independent tokeniz
hello world. sentencepiece is a language independent tokenizer
hello world. sentencepiece is a language independent tokenizer.
>>>