google-research/text-to-text-transfer-transformer

SuperGlue not working using seqio.get_dataset - data type ' *' not understood

Arka161 opened this issue · 0 comments

Describe the bug
Unable to get SuperGlue dataset using seqio.get_dataset. I have used the registry code based on the code given in this example.

Firstly, there's a bug in this line:

seqio.experimental.add_task_with_sentinels("super_glue_%s_v102" % b.name,num_sentinels=1)

Why do we try to add tasks with sentinels when we add the task above? The code doesn't seem to handle this at all.

We get the error:

ValueError: Attempting to register duplicate provider: super_glue_boolq_v102_1_sentinel

I tried working around by skipping the add_task_with sentinels API call. However, when I try to get the dataset using the seqio.get_dataset() call, I get the following error:

    File "/usr/local/lib/python3.8/dist-packages/t5/data/preprocessors.py", line 1321, in map_fn  *
        inputs = [
    File "<__array_function__ internals>", line 5, in result_type
        

    TypeError: data type ' *' not understood

The error comes due to the following function:

  def map_fn(x):
    """Function to be called for every example in dataset."""
    inputs = [
        label,
        tf.strings.regex_replace(
            _wsc_inputs(x), r' X ', ' *' + x['span2_text'] + '* '),
    ]
    referent = x['span1_text']
    return {
        'inputs': tf.strings.join(inputs, separator=' '),
        # The reshape is necessary as otherwise the tensor has unknown rank.
        'targets': tf.reshape(referent, shape=[]),
        'label': x.get('label', 0),
        'idx': x['idx'],
    }

Is there any bug with the given function?

To Reproduce
Steps to reproduce the behavior:

It is very simple to reproduce. If you use the template from the example code in tasks.py for SuperGlue v1.0.2, and try to use seqio.get_dataset(..) for the given registered tasks, the error will come.

Expected behavior
The error should not come, and we should be able to get an iterable dataset.

Desktop (please complete the following information):
Using the newest version of the code from this repository, along with PyTorch 1.10.0.

Additional context
Logs/Stack Trace is given below:

    dataset = seqio.get_dataset(
  File "/usr/local/lib/python3.8/dist-packages/seqio/dataset_providers.py", line 1580, in get_dataset
    ds = mixture_or_task.get_dataset(
  File "/usr/local/lib/python3.8/dist-packages/seqio/dataset_providers.py", line 1372, in get_dataset
    datasets = [
  File "/usr/local/lib/python3.8/dist-packages/seqio/dataset_providers.py", line 1373, in <listcomp>
    task.get_dataset(  # pylint:disable=g-complex-comprehension
  File "/usr/local/lib/python3.8/dist-packages/seqio/dataset_providers.py", line 1121, in get_dataset
    ds = self.preprocess_precache(ds, seed=seed)
  File "/usr/local/lib/python3.8/dist-packages/seqio/dataset_providers.py", line 942, in preprocess_precache
    return self._preprocess_dataset(
  File "/usr/local/lib/python3.8/dist-packages/seqio/dataset_providers.py", line 878, in _preprocess_dataset
    dataset = prep_fn(dataset, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/t5/data/preprocessors.py", line 1338, in wsc_simple
    return dataset.map(map_fn, num_parallel_calls=AUTOTUNE)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/data/ops/dataset_ops.py", line 2006, in map
    return ParallelMapDataset(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/data/ops/dataset_ops.py", line 5501, in __init__
    self._map_func = StructuredFunctionWrapper(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/data/ops/dataset_ops.py", line 4533, in __init__
    self._function = fn_factory()
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/function.py", line 3244, in get_concrete_function
    graph_function = self._get_concrete_function_garbage_collected(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/function.py", line 3210, in _get_concrete_function_garbage_collected
    graph_function, _ = self._maybe_define_function(args, kwargs)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/function.py", line 3557, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/function.py", line 3392, in _create_graph_function
    func_graph_module.func_graph_from_py_func(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/func_graph.py", line 1143, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/data/ops/dataset_ops.py", line 4510, in wrapped_fn
    ret = wrapper_helper(*args)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/data/ops/dataset_ops.py", line 4440, in wrapper_helper
    ret = autograph.tf_convert(self._func, ag_ctx)(*nested_args)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 699, in wrapper
    raise e.ag_error_metadata.to_exception(e)
TypeError: in user code:

    File "/usr/local/lib/python3.8/dist-packages/t5/data/preprocessors.py", line 1321, in map_fn  *
        inputs = [
    File "<__array_function__ internals>", line 5, in result_type
        

    TypeError: data type ' *' not understood

It appears to be a bug with the repo, and would be happy if I can have any answers/workarounds! Thank you.