Allow hugginface tokenizer to pass arguments for add/skip special tokens
Abhishek8394 opened this issue · 1 comments
Abhishek8394 commented
Thank you for this wrapper!
I would like to propose following changes to api, and am contributing the implementation too:
- Allow huggingface tokenizer's
Encode
method to optionally pass inadd_special_tokens
argument. Many models require these special tokens and prepending them to returned vector isn't optimal. - Allow huggingface tokenizer's
Decode
method to optionally pass inskip_special_tokens
, again this saves time during using the string for downstream tasks, instead of slicing returned strings / trimming input vectors.
These changes would be backwards compatible. And users can use this by explicity initializing a HFTokenizer
object or casting a Tokenizer*
to HFTokenizer*
, assuming it indeed is a HFTokenizer
.
These changes will leave the Tokenizer
interface untouched.
DreamGenX commented
As far as I can see, HFTokenizer declaration is not exposed in includes.