Leaving spaces at the beginning of next tokens?

Question

Leaving spaces at the beginning of next tokens?

speedcell4 opened this issue 3 months ago · 3 comments

When I use UnicodeScripts to splits chars from different languages,

from tokenizers import pre_tokenizers

pre = pre_tokenizers.UnicodeScripts()

text = "@ this year12223old isn't これから 45 a bad-thing."
print([token for token, _ in pre.pre_tokenize_str(text)])

The output is ['@ ', 'this year', '12223', 'old isn', "'", 't ', 'これから ', '45 ', 'a bad', '-', 'thing', '.']. The spaces are left at the end of the previous tokens, such as @[space here], t[space here], これから[space here], and 45[space here].

In addition, if Metaspace follows,

from tokenizers import pre_tokenizers

pre = pre_tokenizers.Sequence([
    pre_tokenizers.UnicodeScripts(),
    pre_tokenizers.Metaspace(prepend_scheme='first'),
])

text = "@ this year12223old isn't これから 45 a bad-thing."
print([token for token, _ in pre.pre_tokenize_str(text)])

this will result in many unnecessary underscores (▁) in the final results, like this ['▁@', '▁', 'this', '▁year', '12223', 'old', '▁isn', "'", 't', '▁', 'これから', '▁', '45', '▁', 'a', '▁bad', '-', 'thing', '.']

My expected output is ['@', '[space here]this year', '12223', 'old isn', "'", 't', '[space here]これから', '[space here]45', '[space here]a bad', '-', 'thing', '.'] and get ['▁@', '▁this', '▁year', '12223', 'old', '▁isn', "'", 't', '▁これから', '▁45', '▁a', '▁bad', '-', 'thing', '.'] in the end.

So, is it possible to leave spaces at the beginning of the next tokens instead of at the ends of the previous tokens?

Answer 1 · 2024-10-11T10:17:03.000Z

The same issue occurs with pre_tokenizers.Digits() and pre_tokenizers.Punctuation().

Answer 2 · 2024-10-14T10:34:57.000Z

I found I don't really need UnicodeScripts, just Digits and Punctuation are fine.
And got it solved by pre_tokenizers.Split(Regex(r' *(([\p{P}\p{S}])|(\d+))'), 'isolated').

Thanks anyway.

Answer 3 · 2024-10-15T14:24:36.000Z

Awesome and sorry I was thinking that Split is what you needed for sure (Metaspace has that embedded) 🤗