wordpiece_encode ids

Question

wordpiece_encode ids

Closed this issue 4 years ago · 2 comments

When type = ids, wordpiece_encode currently returns 0-indexed ids. This is explicitly coded on https://github.com/bnosac/sentencepiece/blob/master/R/wordpiece.R#L40 with a - 1L.

I feel like this number should probably be 1-indexed, since we're passing in an R character vector. A lot of usecases would then have to subtract the 1L, so I'm not strongly advocating, but it's at least worth noting in help.

Let me know what you'd prefer and I can PR.

Answer 1 · 2020-11-05T08:54:28.000Z

I'm not a fan of this as this behaviour is similar in sentencepiece::sentencepiece_encode and sentencepiece::sentencepiece_decode and tokenizers.bpe::bpe_encode and tokenizers.bpe::bpe_decode and in all models which are available.

Answer 2 · 2020-11-05T14:53:03.000Z

Ok, no problem!