Leaving spaces at the beginning of next tokens?
speedcell4 opened this issue · 3 comments
When I use UnicodeScripts
to splits chars from different languages,
from tokenizers import pre_tokenizers
pre = pre_tokenizers.UnicodeScripts()
text = "@ this year12223old isn't これから 45 a bad-thing."
print([token for token, _ in pre.pre_tokenize_str(text)])
The output is ['@ ', 'this year', '12223', 'old isn', "'", 't ', 'これから ', '45 ', 'a bad', '-', 'thing', '.']
. The spaces are left at the end of the previous tokens, such as @[space here]
, t[space here]
, これから[space here]
, and 45[space here]
.
In addition, if Metaspace
follows,
from tokenizers import pre_tokenizers
pre = pre_tokenizers.Sequence([
pre_tokenizers.UnicodeScripts(),
pre_tokenizers.Metaspace(prepend_scheme='first'),
])
text = "@ this year12223old isn't これから 45 a bad-thing."
print([token for token, _ in pre.pre_tokenize_str(text)])
this will result in many unnecessary underscores (▁) in the final results, like this ['▁@', '▁', 'this', '▁year', '12223', 'old', '▁isn', "'", 't', '▁', 'これから', '▁', '45', '▁', 'a', '▁bad', '-', 'thing', '.']
My expected output is ['@', '[space here]this year', '12223', 'old isn', "'", 't', '[space here]これから', '[space here]45', '[space here]a bad', '-', 'thing', '.']
and get ['▁@', '▁this', '▁year', '12223', 'old', '▁isn', "'", 't', '▁これから', '▁45', '▁a', '▁bad', '-', 'thing', '.']
in the end.
So, is it possible to leave spaces at the beginning of the next tokens instead of at the ends of the previous tokens?
The same issue occurs with pre_tokenizers.Digits() and pre_tokenizers.Punctuation()
.
I found I don't really need UnicodeScripts
, just Digits
and Punctuation
are fine.
And got it solved by pre_tokenizers.Split(Regex(r' *(([\p{P}\p{S}])|(\d+))'), 'isolated')
.
Thanks anyway.
Awesome and sorry I was thinking that Split
is what you needed for sure (Metaspace has that embedded) 🤗