google/sentencepiece

logprobs in the vocabulary file do not match the values computed from the tokenized training document

Closed this issue · 2 comments

I trained a unigram model on botchan.txt following the documentation examples. I then reapplied this model to the training text and I evaluated new logprobs from it by counting the tokens.

These logprobs do not match exactly and the token order is not the same. I cannot explain why.

I used this command to create the model:
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=1000 --eos_id=-1 --bos_id=-1')

This produced a model vocabulary with these values:

<unk>	0
,	-3.40684
.	-3.54053
▁the	-3.54218
▁	-3.61926
s	-3.65378
▁I	-3.88789
▁to	-4.0266
t	-4.09847
...

I tokenized botchan.txt with: sp.encode(corpus_raw, out_type=str)
And I computed the word logprobs from the tokenized text:

',': -3.421150474912426,
'.': -3.5544435623622155,
'▁the': -3.572524455004845,
's': -3.7199046282648927,
'▁I': -3.9171822762124973,
'▁': -3.9276478082521584,
'▁to': -4.038808303148048,
'ed': -4.096767853187577,
...

The values are close, but not always as for '▁' and the order is different.

Does anyone have an explanation?

If you just counted the tokens only from the output of encode method, the probabilities would be different.

In the ULM training, the EM algorithm is used to compute the marginal probabilities considering all possible tokenizations.
The sp.encode() just performs Viterbi (one-best) decoding, which doesn't consider the all possible tokenizations.

pnugues commented