Add `oov_token` Argument to `BytePairTokenizer`

Question

Add `oov_token` Argument to `BytePairTokenizer`

Opened this issue 4 months ago · 1 comments

The <unk> token is not really used by the BytePairTokenizer, instead oov tokens will be mapped to -1, That will cause index error for embedding layer.
This will only occur in the case where vocabulary is limited -doesn't contain all the bytes- for example when trying an example with custom small vocabulary rather than using a preset, but adding this feature will be better.

Answer 1 · 2024-03-05T02:37:12.000Z

How would we handle this for things like GPT2, which has no unk token in the vocabulary or index reserved for it? Seems fine to add as long as an optional setting for small test vocabularies.