Using ArrowDataset with tf.data.Dataset.interleave causes crashes

Question

Using ArrowDataset with tf.data.Dataset.interleave causes crashes

sebp opened this issue a year ago · 0 comments

Description

I want to use ArrowDataset to directly load record patches from a string Tensor via tf.data.Dataset.interleave, however the program completely crashes instead.

Actual result

2023-04-28 11:05:47.396681: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-28 11:05:47.960406: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-04-28 11:05:48.626703: I tensorflow_io/core/kernels/cpu_check.cc:128] Your CPU supports instructions that this TensorFlow IO binary was not compiled to use: AVX2 FMA
2023-04-28 11:05:48.734126: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [3]
         [[{{node Placeholder/_0}}]]
2023-04-28 11:05:48.734300: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [3]
         [[{{node Placeholder/_0}}]]
2023-04-28 11:05:48.745237: E tensorflow/core/framework/dataset.cc:617] UNIMPLEMENTED: Cannot compute input sources for dataset of type IO>ArrowSerializedDataset, because the dataset does not implement `InputDatasets`.
2023-04-28 11:05:48.745307: E tensorflow/core/framework/dataset.cc:621] UNIMPLEMENTED: Cannot merge options for dataset of type IO>ArrowSerializedDataset, because the dataset does not implement `InputDatasets`.
external/arrow/cpp/src/arrow/array/data.cc:95:  Check failed: (off) <= (length) Slice offset greater than array length
fish: Job 1, ''python arr.py' terminated by signal SIGABRT (Abort)

Expected result

Same output as if passing list of record batches directly to ArrowDataset.from_record_batches.

Code to reproduce

import numpy as np
import pyarrow as pa
import tensorflow as tf
from tensorflow_io.arrow import ArrowDataset

# create arrow data
data = [
     pa.array([np.random.randn(128).astype(np.float32) for _ in range(3)]),
     pa.array(['foo', 'bar', 'baz']),
     pa.array([True, False, True])
]
# create arrow batch
batch = pa.record_batch(data, names=['f0', 'f1', 'f2'])

# write data to 3 files
data_files = ['dataset0.arrow', 'dataset1.arrow', 'dataset2.arrow']
for data_file in data_files:
    with open(data_file, 'wb') as sink:
        with pa.ipc.new_file(sink, batch.schema) as writer:
            for i in range(10):
                writer.write_batch(batch)

# tensorflow types and shape of data stored in record batch
output_types = (tf.float32, tf.string, tf.bool,)
output_shapes = (tf.TensorShape([128]), tf.TensorShape([]), tf.TensorShape([]),)

# create dataset
ds = tf.data.Dataset.from_tensor_slices(
    data_files
).map(
    tf.io.read_file
).interleave(
    lambda x: ArrowDataset(x, columns=(0, 1, 2,), output_types=output_types, output_shapes=output_shapes),
    cycle_length=1,
)

Versions

Ubuntu 22.04.1 LTS (Jammy Jellyfish)
Python 3.8.13
packages:

Package                      Version
---------------------------- ---------
absl-py                      1.4.0
astunparse                   1.6.3
cachetools                   5.3.0
certifi                      2022.12.7
charset-normalizer           3.1.0
flatbuffers                  23.3.3
gast                         0.4.0
google-auth                  2.17.3
google-auth-oauthlib         1.0.0
google-pasta                 0.2.0
grpcio                       1.54.0
h5py                         3.8.0
idna                         3.4
importlib-metadata           6.6.0
jax                          0.4.8
keras                        2.12.0
libclang                     16.0.0
Markdown                     3.4.3
MarkupSafe                   2.1.2
ml-dtypes                    0.1.0
numpy                        1.23.5
oauthlib                     3.2.2
opt-einsum                   3.3.0
packaging                    23.1
pip                          22.0.4
protobuf                     4.22.3
pyarrow                      10.0.1
pyasn1                       0.5.0
pyasn1-modules               0.3.0
requests                     2.29.0
requests-oauthlib            1.3.1
rsa                          4.9
scipy                        1.10.1
setuptools                   56.0.0
six                          1.16.0
tensorboard                  2.12.2
tensorboard-data-server      0.7.0
tensorboard-plugin-wit       1.8.1
tensorflow                   2.12.0
tensorflow-estimator         2.12.0
tensorflow-io                0.32.0
tensorflow-io-gcs-filesystem 0.32.0
termcolor                    2.3.0
typing_extensions            4.5.0
urllib3                      1.26.15
Werkzeug                     2.3.1
wheel                        0.40.0
wrapt                        1.14.1
zipp                         3.15.0