What is the purpose for the regex substitution in preprocessing?
marksverdhei opened this issue · 1 comments
marksverdhei commented
Hi,
I'm working on fine tuning the model for a custom dataset, and therefore wanted to replicate the data prerocessing.
As the README points to, I look looked at the preprocessing function and noticed this regex substitution.
Lines 54 to 58 in 7bf0653
I tried it out in the python shell to double check
>>> import tensorflow as tf
>>> s = "abc def 'ghi' jkl 'mno' 'pqr'"
>>> s_lower
<tf.Tensor: shape=(), dtype=string, numpy=b"abc def 'ghi' jkl 'mno' 'pqr'">
>>> s_lower = tf.strings.lower(s)
>>> tf.strings.regex_replace(s_lower, "'(.*)'", r"\1")
<tf.Tensor: shape=(), dtype=string, numpy=b"abc def ghi' jkl 'mno' 'pqr">
So it seems that it removes the first and last quotes of an input strings, but not the ones in the middle.
Is this intentional?
The docstring says "remove quotes" so I wasn't sure if it just meant all quotes or first and last, or if there is something I'm missing
danyaljj commented
All these were inherited [semi-blindly] from this example. See the definition of trivia_preprocessor
function.