tensorflow/text

add more flexibility in fast_wordpiece_tokenizer_model_builder pretokeniztion

abuelnasr0 opened this issue · 0 comments

there's only one pararmeter to control pretokeniztion on (white space, punctuation, chinese chars). and there is situation where you want to pretokenize in punctuation and white space only. I suggest adding two bool parameters one for punctuation and one for chinese.
furthermore, an approach for generalization is to add an array of pairs parameter, where it contains ranges of chars to pretokenize on it.

[Edit] I was working in a task for keras_nlp where this problem emerged. it's better to handle this problem here rather than in keras_nlp.