wordpiece_encode ids
Closed this issue · 2 comments
jonthegeek commented
When type = ids
, wordpiece_encode
currently returns 0-indexed ids. This is explicitly coded on https://github.com/bnosac/sentencepiece/blob/master/R/wordpiece.R#L40 with a - 1L
.
I feel like this number should probably be 1-indexed, since we're passing in an R character vector. A lot of usecases would then have to subtract the 1L, so I'm not strongly advocating, but it's at least worth noting in help.
Let me know what you'd prefer and I can PR.
jwijffels commented
I'm not a fan of this as this behaviour is similar in sentencepiece::sentencepiece_encode and sentencepiece::sentencepiece_decode and tokenizers.bpe::bpe_encode and tokenizers.bpe::bpe_decode and in all models which are available.
jonthegeek commented
Ok, no problem!