DeepRec-AI/HybridBackend

rebatch api produce an Check failed: limit <= dim0_size error

Closed this issue · 2 comments

Current behavior

After rebatch(), data iterator get_next() produce an error:

F tensorflow/core/framework/tensor.cc:833] Check failed: limit <= dim0_size (8194 vs. 8193)

Expected behavior

no error

System information

  • OS Platform and Distribution: Ubuntu 18.04.5 LTS
  • TensorFlow version: 1.15.0
  • Python version: 3.6
  • CUDA/cuDNN version: 10.1
  • RAM: 94G
  • GPU model and memory: Tesla T4, 16G

Code to reproduce

Step 1: Generate a parquet file by running following code

import numpy as np
import pandas as pd
import random

data_list = []
for i in range(1, 10000):
    int_feature = random.randint(1, 100)
    # float_feature = random.random()
    array_feature = [random.randint(1, 10) for x in range(0, 4)]
    data_list.append([int_feature, array_feature])

df = pd.DataFrame(data_list, columns=["int_feature", "array_feature"])
df.to_parquet("parquet_sample_file.parquet")

Step 2: Load generated parquet in step 1 by HybridBackend

import tensorflow as tf
import hybridbackend.tensorflow as hb


filenames_ds = tf.data.Dataset.from_tensor_slices(['file1.snappy.parquet', 'file2.snappy.parquet', ... 'fileN.snappy.parquet'])


hb_fields = []
hb_fields.append(hb.data.DataFrame.Field("feature1", tf.int64, ragged_rank=0))
hb_fields.append(hb.data.DataFrame.Field("feature2", tf.float32, ragged_rank=1))
hb_fields.append(hb.data.DataFrame.Field("feature3", tf.int64, ragged_rank=1))

ds = filenames_ds.apply(hb.data.read_parquet(8192, hb_fields, num_parallel_reads=tf.data.experimental.AUTOTUNE))
iterator = ds.apply(hb.data.rebatch(8192, fields=hb_fields))

it = iterator.make_one_shot_iterator()
item = it.get_next()

batch_size_dict = {}
with tf.Session() as sess:
    print("======  start ======")
    total_batch_size = 0
    while True:
        try:
            batch = sess.run(item)
            batch_size = len(batch['mod_series'])
            batch_size_dict[batch_size] = batch_size_dict.get(batch_size, 0) + 1
        except tf.errors.OutOfRangeError:
            break

Running above code in a pyhon3 shell, an error shall be thrown:

F tensorflow/core/framework/tensor.cc:833] Check failed: limit <= dim0_size (8194 vs. 8193)

Willing to contribute

Yes

Thanks for reporting, can you provide a sample file for reproducing this issue?

Thanks for reporting, can you provide a sample file for reproducing this issue?

(1) Generate a parquet file by running following code

import numpy as np
import pandas as pd
import random

data_list = []
for i in range(1, 10000):
    int_feature = random.randint(1, 100)
    # float_feature = random.random()
    array_feature = [random.randint(1, 10) for x in range(0, 4)]
    data_list.append([int_feature, array_feature])

df = pd.DataFrame(data_list, columns=["int_feature", "array_feature"])
df.to_parquet("parquet_sample_file.parquet")

(2) Load generated parquet file by HybridBackend will reproduce this issue

filenames_ds = tf.data.Dataset.from_tensor_slices(["parquet_sample_file.parquet"])

hb_fields = []
hb_fields.append(hb.data.DataFrame.Field("int_feature", tf.int64, ragged_rank=0))
# hb_fields.append(hb.data.DataFrame.Field("float_feature", tf.float32, ragged_rank=0))
hb_fields.append(hb.data.DataFrame.Field("array_feature", tf.int64, ragged_rank=1))

iterator = filenames_ds.apply(hb.data.read_parquet(100, hb_fields, num_parallel_reads=tf.data.experimental.AUTOTUNE))
iterator = iterator.apply(hb.data.rebatch(100, fields=hb_fields)).repeat(30)

iterator = iterator.make_one_shot_iterator()
item = iterator.get_next()
with tf.Session() as sess:
    print("======  start ======")
    total_batch_size = 0
    while True:
        try:
            a = sess.run(item)
        except tf.errors.OutOfRangeError:
            break