SuperGlue not working using seqio.get_dataset - data type ' *' not understood
Arka161 opened this issue · 0 comments
Describe the bug
Unable to get SuperGlue dataset using seqio.get_dataset
. I have used the registry code based on the code given in this example.
Firstly, there's a bug in this line:
seqio.experimental.add_task_with_sentinels("super_glue_%s_v102" % b.name,num_sentinels=1)
Why do we try to add tasks with sentinels when we add the task above? The code doesn't seem to handle this at all.
We get the error:
ValueError: Attempting to register duplicate provider: super_glue_boolq_v102_1_sentinel
I tried working around by skipping the add_task_with sentinels
API call. However, when I try to get the dataset using the seqio.get_dataset()
call, I get the following error:
File "/usr/local/lib/python3.8/dist-packages/t5/data/preprocessors.py", line 1321, in map_fn *
inputs = [
File "<__array_function__ internals>", line 5, in result_type
TypeError: data type ' *' not understood
The error comes due to the following function:
def map_fn(x):
"""Function to be called for every example in dataset."""
inputs = [
label,
tf.strings.regex_replace(
_wsc_inputs(x), r' X ', ' *' + x['span2_text'] + '* '),
]
referent = x['span1_text']
return {
'inputs': tf.strings.join(inputs, separator=' '),
# The reshape is necessary as otherwise the tensor has unknown rank.
'targets': tf.reshape(referent, shape=[]),
'label': x.get('label', 0),
'idx': x['idx'],
}
Is there any bug with the given function?
To Reproduce
Steps to reproduce the behavior:
It is very simple to reproduce. If you use the template from the example code in tasks.py
for SuperGlue v1.0.2, and try to use seqio.get_dataset(..)
for the given registered tasks, the error will come.
Expected behavior
The error should not come, and we should be able to get an iterable dataset.
Desktop (please complete the following information):
Using the newest version of the code from this repository, along with PyTorch 1.10.0.
Additional context
Logs/Stack Trace is given below:
dataset = seqio.get_dataset(
File "/usr/local/lib/python3.8/dist-packages/seqio/dataset_providers.py", line 1580, in get_dataset
ds = mixture_or_task.get_dataset(
File "/usr/local/lib/python3.8/dist-packages/seqio/dataset_providers.py", line 1372, in get_dataset
datasets = [
File "/usr/local/lib/python3.8/dist-packages/seqio/dataset_providers.py", line 1373, in <listcomp>
task.get_dataset( # pylint:disable=g-complex-comprehension
File "/usr/local/lib/python3.8/dist-packages/seqio/dataset_providers.py", line 1121, in get_dataset
ds = self.preprocess_precache(ds, seed=seed)
File "/usr/local/lib/python3.8/dist-packages/seqio/dataset_providers.py", line 942, in preprocess_precache
return self._preprocess_dataset(
File "/usr/local/lib/python3.8/dist-packages/seqio/dataset_providers.py", line 878, in _preprocess_dataset
dataset = prep_fn(dataset, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/t5/data/preprocessors.py", line 1338, in wsc_simple
return dataset.map(map_fn, num_parallel_calls=AUTOTUNE)
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/data/ops/dataset_ops.py", line 2006, in map
return ParallelMapDataset(
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/data/ops/dataset_ops.py", line 5501, in __init__
self._map_func = StructuredFunctionWrapper(
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/data/ops/dataset_ops.py", line 4533, in __init__
self._function = fn_factory()
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/function.py", line 3244, in get_concrete_function
graph_function = self._get_concrete_function_garbage_collected(
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/function.py", line 3210, in _get_concrete_function_garbage_collected
graph_function, _ = self._maybe_define_function(args, kwargs)
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/function.py", line 3557, in _maybe_define_function
graph_function = self._create_graph_function(args, kwargs)
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/function.py", line 3392, in _create_graph_function
func_graph_module.func_graph_from_py_func(
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/func_graph.py", line 1143, in func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/data/ops/dataset_ops.py", line 4510, in wrapped_fn
ret = wrapper_helper(*args)
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/data/ops/dataset_ops.py", line 4440, in wrapper_helper
ret = autograph.tf_convert(self._func, ag_ctx)(*nested_args)
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 699, in wrapper
raise e.ag_error_metadata.to_exception(e)
TypeError: in user code:
File "/usr/local/lib/python3.8/dist-packages/t5/data/preprocessors.py", line 1321, in map_fn *
inputs = [
File "<__array_function__ internals>", line 5, in result_type
TypeError: data type ' *' not understood
It appears to be a bug with the repo, and would be happy if I can have any answers/workarounds! Thank you.