Using ArrowDataset with tf.data.Dataset.interleave causes crashes
sebp opened this issue · 0 comments
sebp commented
Description
I want to use ArrowDataset
to directly load record patches from a string Tensor via tf.data.Dataset.interleave
, however the program completely crashes instead.
Actual result
2023-04-28 11:05:47.396681: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-28 11:05:47.960406: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-04-28 11:05:48.626703: I tensorflow_io/core/kernels/cpu_check.cc:128] Your CPU supports instructions that this TensorFlow IO binary was not compiled to use: AVX2 FMA
2023-04-28 11:05:48.734126: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [3]
[[{{node Placeholder/_0}}]]
2023-04-28 11:05:48.734300: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [3]
[[{{node Placeholder/_0}}]]
2023-04-28 11:05:48.745237: E tensorflow/core/framework/dataset.cc:617] UNIMPLEMENTED: Cannot compute input sources for dataset of type IO>ArrowSerializedDataset, because the dataset does not implement `InputDatasets`.
2023-04-28 11:05:48.745307: E tensorflow/core/framework/dataset.cc:621] UNIMPLEMENTED: Cannot merge options for dataset of type IO>ArrowSerializedDataset, because the dataset does not implement `InputDatasets`.
external/arrow/cpp/src/arrow/array/data.cc:95: Check failed: (off) <= (length) Slice offset greater than array length
fish: Job 1, ''python arr.py' terminated by signal SIGABRT (Abort)
Expected result
Same output as if passing list of record batches directly to ArrowDataset.from_record_batches
.
Code to reproduce
import numpy as np
import pyarrow as pa
import tensorflow as tf
from tensorflow_io.arrow import ArrowDataset
# create arrow data
data = [
pa.array([np.random.randn(128).astype(np.float32) for _ in range(3)]),
pa.array(['foo', 'bar', 'baz']),
pa.array([True, False, True])
]
# create arrow batch
batch = pa.record_batch(data, names=['f0', 'f1', 'f2'])
# write data to 3 files
data_files = ['dataset0.arrow', 'dataset1.arrow', 'dataset2.arrow']
for data_file in data_files:
with open(data_file, 'wb') as sink:
with pa.ipc.new_file(sink, batch.schema) as writer:
for i in range(10):
writer.write_batch(batch)
# tensorflow types and shape of data stored in record batch
output_types = (tf.float32, tf.string, tf.bool,)
output_shapes = (tf.TensorShape([128]), tf.TensorShape([]), tf.TensorShape([]),)
# create dataset
ds = tf.data.Dataset.from_tensor_slices(
data_files
).map(
tf.io.read_file
).interleave(
lambda x: ArrowDataset(x, columns=(0, 1, 2,), output_types=output_types, output_shapes=output_shapes),
cycle_length=1,
)
Versions
- Ubuntu 22.04.1 LTS (Jammy Jellyfish)
- Python 3.8.13
- packages:
Package Version
---------------------------- ---------
absl-py 1.4.0
astunparse 1.6.3
cachetools 5.3.0
certifi 2022.12.7
charset-normalizer 3.1.0
flatbuffers 23.3.3
gast 0.4.0
google-auth 2.17.3
google-auth-oauthlib 1.0.0
google-pasta 0.2.0
grpcio 1.54.0
h5py 3.8.0
idna 3.4
importlib-metadata 6.6.0
jax 0.4.8
keras 2.12.0
libclang 16.0.0
Markdown 3.4.3
MarkupSafe 2.1.2
ml-dtypes 0.1.0
numpy 1.23.5
oauthlib 3.2.2
opt-einsum 3.3.0
packaging 23.1
pip 22.0.4
protobuf 4.22.3
pyarrow 10.0.1
pyasn1 0.5.0
pyasn1-modules 0.3.0
requests 2.29.0
requests-oauthlib 1.3.1
rsa 4.9
scipy 1.10.1
setuptools 56.0.0
six 1.16.0
tensorboard 2.12.2
tensorboard-data-server 0.7.0
tensorboard-plugin-wit 1.8.1
tensorflow 2.12.0
tensorflow-estimator 2.12.0
tensorflow-io 0.32.0
tensorflow-io-gcs-filesystem 0.32.0
termcolor 2.3.0
typing_extensions 4.5.0
urllib3 1.26.15
Werkzeug 2.3.1
wheel 0.40.0
wrapt 1.14.1
zipp 3.15.0