Unexpected outputs with regex_split.
markblee opened this issue · 1 comments
markblee commented
I want to achieve something simple: split a string at places where we see a space, followed by non-space. It seems to fail when there are newlines:
> pat = "\s(\S+)"
# Works with tabs.
> regex_split("\ta\ttest", pat, keep_delim_regex_pattern=".*")
<tf.RaggedTensor [[b'\ta', b'\ttest']]>
# Fails with newline.
> regex_split("\ta\ntest", pat, keep_delim_regex_pattern=".*")
<tf.RaggedTensor [[b'\ta']]>
For reference, I see this with Python's re
:
> re.findall(pat, "\ta\ttest")
['\t', 'a', '\t', 'test']
> re.findall(pat, "\ta\ntest")
['\t', 'a', '\n', 'test']
This kind of unexpected "cutoff" of the input also seems to happen with null characters \0
. Is this intended behavior?
markblee commented
Ah, looks like I mistakenly assumed .
included newlines. Replacing with (?s).*
appears to work.