cbaziotis/ekphrasis

tokenizing '20th' to '2','0','th'

KavishBhatia opened this issue · 1 comments

How to make this as one token and not separate it. Where is this tokenizing happening?

it happen in the default pipeline of tokenizer here. You can pass a custom pipeline to the tokenizer and removing "EMOJI" from that pipeline removes this problem.