Feature request: Modify `text.regex_split_with_offsets()` behavior to be in line with `tf.strings.length()`
Opened this issue · 1 comments
briango28 commented
text.regex_split_with_offsets()
currently returns begin
and end
as tf.int64
tensors that count indices in bytes.
tf.strings.length()
on the other hand, returns a tf.int32
tensor which counts lengths in either bytes or UTF8 characters according to the value of the parameter unit
.
So this would actually be two separate requests:
- Change the return types of
text.regex_split_with_offsets()
totf.int32
, removing the need for a cast when comparing withtf.strings.length()
. I doubt there will be a use case for strings longer than INT32_MAX in the foreseeable future. - Add parameter
unit: Literal["BYTE", "UTF8_CHAR"] = "BYTE"
matching the behavior oftf.strings.length()
andtf.strings.substr()
. Seeing the regular expressions are already being interpreted in 'utf-8', I think it would make sense to add a layer of abstraction to facilitate slicing by UTF-8 character index.
briango28 commented
Follow-up
Having converted begin
& end
indices from BYTE
to UTF8
with
offsets = tf.strings.unicode_decode_with_offsets(txt, 'UTF8')[1]
begin = tf.map_fn(lambda indices: tf.where(tf.expand_dims(indices, 1) == offsets)[:, 1], begin)
Where tf.strings.unicode_decode_with_offsets()
returns offsets with type tf.int64
, I'm not so sure about no. 1 anymore :/