rebatch api produce an Check failed: limit <= dim0_size error
Closed this issue · 2 comments
liurcme commented
Current behavior
After rebatch(), data iterator get_next() produce an error:
F tensorflow/core/framework/tensor.cc:833] Check failed: limit <= dim0_size (8194 vs. 8193)
Expected behavior
no error
System information
- OS Platform and Distribution: Ubuntu 18.04.5 LTS
- TensorFlow version: 1.15.0
- Python version: 3.6
- CUDA/cuDNN version: 10.1
- RAM: 94G
- GPU model and memory: Tesla T4, 16G
Code to reproduce
Step 1: Generate a parquet file by running following code
import numpy as np
import pandas as pd
import random
data_list = []
for i in range(1, 10000):
int_feature = random.randint(1, 100)
# float_feature = random.random()
array_feature = [random.randint(1, 10) for x in range(0, 4)]
data_list.append([int_feature, array_feature])
df = pd.DataFrame(data_list, columns=["int_feature", "array_feature"])
df.to_parquet("parquet_sample_file.parquet")
Step 2: Load generated parquet in step 1 by HybridBackend
import tensorflow as tf
import hybridbackend.tensorflow as hb
filenames_ds = tf.data.Dataset.from_tensor_slices(['file1.snappy.parquet', 'file2.snappy.parquet', ... 'fileN.snappy.parquet'])
hb_fields = []
hb_fields.append(hb.data.DataFrame.Field("feature1", tf.int64, ragged_rank=0))
hb_fields.append(hb.data.DataFrame.Field("feature2", tf.float32, ragged_rank=1))
hb_fields.append(hb.data.DataFrame.Field("feature3", tf.int64, ragged_rank=1))
ds = filenames_ds.apply(hb.data.read_parquet(8192, hb_fields, num_parallel_reads=tf.data.experimental.AUTOTUNE))
iterator = ds.apply(hb.data.rebatch(8192, fields=hb_fields))
it = iterator.make_one_shot_iterator()
item = it.get_next()
batch_size_dict = {}
with tf.Session() as sess:
print("====== start ======")
total_batch_size = 0
while True:
try:
batch = sess.run(item)
batch_size = len(batch['mod_series'])
batch_size_dict[batch_size] = batch_size_dict.get(batch_size, 0) + 1
except tf.errors.OutOfRangeError:
break
Running above code in a pyhon3 shell, an error shall be thrown:
F tensorflow/core/framework/tensor.cc:833] Check failed: limit <= dim0_size (8194 vs. 8193)
Willing to contribute
Yes
2sin18 commented
Thanks for reporting, can you provide a sample file for reproducing this issue?
liurcme commented
Thanks for reporting, can you provide a sample file for reproducing this issue?
(1) Generate a parquet file by running following code
import numpy as np
import pandas as pd
import random
data_list = []
for i in range(1, 10000):
int_feature = random.randint(1, 100)
# float_feature = random.random()
array_feature = [random.randint(1, 10) for x in range(0, 4)]
data_list.append([int_feature, array_feature])
df = pd.DataFrame(data_list, columns=["int_feature", "array_feature"])
df.to_parquet("parquet_sample_file.parquet")
(2) Load generated parquet file by HybridBackend will reproduce this issue
filenames_ds = tf.data.Dataset.from_tensor_slices(["parquet_sample_file.parquet"])
hb_fields = []
hb_fields.append(hb.data.DataFrame.Field("int_feature", tf.int64, ragged_rank=0))
# hb_fields.append(hb.data.DataFrame.Field("float_feature", tf.float32, ragged_rank=0))
hb_fields.append(hb.data.DataFrame.Field("array_feature", tf.int64, ragged_rank=1))
iterator = filenames_ds.apply(hb.data.read_parquet(100, hb_fields, num_parallel_reads=tf.data.experimental.AUTOTUNE))
iterator = iterator.apply(hb.data.rebatch(100, fields=hb_fields)).repeat(30)
iterator = iterator.make_one_shot_iterator()
item = iterator.get_next()
with tf.Session() as sess:
print("====== start ======")
total_batch_size = 0
while True:
try:
a = sess.run(item)
except tf.errors.OutOfRangeError:
break