mlc-ai/tokenizers-cpp

Allow hugginface tokenizer to pass arguments for add/skip special tokens

Abhishek8394 opened this issue · 1 comments

Thank you for this wrapper!
I would like to propose following changes to api, and am contributing the implementation too:

  • Allow huggingface tokenizer's Encode method to optionally pass in add_special_tokens argument. Many models require these special tokens and prepending them to returned vector isn't optimal.
  • Allow huggingface tokenizer's Decode method to optionally pass in skip_special_tokens, again this saves time during using the string for downstream tasks, instead of slicing returned strings / trimming input vectors.

These changes would be backwards compatible. And users can use this by explicity initializing a HFTokenizer object or casting a Tokenizer* to HFTokenizer*, assuming it indeed is a HFTokenizer.

These changes will leave the Tokenizer interface untouched.

As far as I can see, HFTokenizer declaration is not exposed in includes.