tensorflow/text

Unexpected outputs with regex_split.

markblee opened this issue · 1 comments

I want to achieve something simple: split a string at places where we see a space, followed by non-space. It seems to fail when there are newlines:

> pat = "\s(\S+)"

# Works with tabs.
> regex_split("\ta\ttest", pat, keep_delim_regex_pattern=".*")
<tf.RaggedTensor [[b'\ta', b'\ttest']]>

# Fails with newline.
> regex_split("\ta\ntest", pat, keep_delim_regex_pattern=".*")
<tf.RaggedTensor [[b'\ta']]>

For reference, I see this with Python's re:

> re.findall(pat, "\ta\ttest")
['\t', 'a', '\t', 'test']

> re.findall(pat, "\ta\ntest")
['\t', 'a', '\n', 'test']

This kind of unexpected "cutoff" of the input also seems to happen with null characters \0. Is this intended behavior?

Ah, looks like I mistakenly assumed . included newlines. Replacing with (?s).* appears to work.