bnosac/sentencepiece

wordpiece_encode ids

Closed this issue · 2 comments

When type = ids, wordpiece_encode currently returns 0-indexed ids. This is explicitly coded on https://github.com/bnosac/sentencepiece/blob/master/R/wordpiece.R#L40 with a - 1L.

I feel like this number should probably be 1-indexed, since we're passing in an R character vector. A lot of usecases would then have to subtract the 1L, so I'm not strongly advocating, but it's at least worth noting in help.

Let me know what you'd prefer and I can PR.

I'm not a fan of this as this behaviour is similar in sentencepiece::sentencepiece_encode and sentencepiece::sentencepiece_decode and tokenizers.bpe::bpe_encode and tokenizers.bpe::bpe_decode and in all models which are available.

Ok, no problem!