allenai/unifiedqa

What is the purpose for the regex substitution in preprocessing?

marksverdhei opened this issue · 1 comments

Hi,

I'm working on fine tuning the model for a custom dataset, and therefore wanted to replicate the data prerocessing.
As the README points to, I look looked at the preprocessing function and noticed this regex substitution.

unifiedqa/tasks.py

Lines 54 to 58 in 7bf0653

def normalize_text(text):
"""Lowercase and remove quotes from a TensorFlow string."""
text = tf.strings.lower(text)
text = tf.strings.regex_replace(text, "'(.*)'", r"\1")
return text

I tried it out in the python shell to double check

>>> import tensorflow as tf
>>> s = "abc def 'ghi' jkl 'mno' 'pqr'"
>>> s_lower
<tf.Tensor: shape=(), dtype=string, numpy=b"abc def 'ghi' jkl 'mno' 'pqr'">
>>> s_lower = tf.strings.lower(s)
>>> tf.strings.regex_replace(s_lower, "'(.*)'", r"\1")
<tf.Tensor: shape=(), dtype=string, numpy=b"abc def ghi' jkl 'mno' 'pqr">

So it seems that it removes the first and last quotes of an input strings, but not the ones in the middle.
Is this intentional?
The docstring says "remove quotes" so I wasn't sure if it just meant all quotes or first and last, or if there is something I'm missing

All these were inherited [semi-blindly] from this example. See the definition of trivia_preprocessor function.