Feature request: Modify `text.regex_split_with_offsets()` behavior to be in line with `tf.strings.length()`

Question

Feature request: Modify `text.regex_split_with_offsets()` behavior to be in line with `tf.strings.length()`

Opened this issue 6 months ago · 1 comments

text.regex_split_with_offsets() currently returns begin and end as tf.int64 tensors that count indices in bytes.

tf.strings.length() on the other hand, returns a tf.int32 tensor which counts lengths in either bytes or UTF8 characters according to the value of the parameter unit.

So this would actually be two separate requests:

Change the return types of text.regex_split_with_offsets() to tf.int32, removing the need for a cast when comparing with tf.strings.length(). I doubt there will be a use case for strings longer than INT32_MAX in the foreseeable future.
Add parameter unit: Literal["BYTE", "UTF8_CHAR"] = "BYTE" matching the behavior of tf.strings.length() and tf.strings.substr(). Seeing the regular expressions are already being interpreted in 'utf-8', I think it would make sense to add a layer of abstraction to facilitate slicing by UTF-8 character index.

Answer 1 · 2024-01-19T07:54:44.000Z

Follow-up

Having converted begin & end indices from BYTE to UTF8 with

offsets = tf.strings.unicode_decode_with_offsets(txt, 'UTF8')[1]
begin = tf.map_fn(lambda indices: tf.where(tf.expand_dims(indices, 1) == offsets)[:, 1], begin)

Where tf.strings.unicode_decode_with_offsets() returns offsets with type tf.int64, I'm not so sure about no. 1 anymore :/