waymo-research/waymax

How to correctly read a TFRecord dataset with only one file?

Closed this issue · 2 comments

I am trying to read a TFRecord dataset with TensorFlow, which contains only one file. The dataset path is path_to_my_tfrecord_file.tfrecord-00008-of-00150. However, I am encountering issues when trying to read this dataset.

Originally, I specified the dataset path as follows:

path='gs://some_path_to_dataset/training_tfexample.tfrecord-00000-of-01000',

But when I attempted to read the data, I received a series of error messages indicating that it could not parse specific keys (e.g., roadgraph_samples/xyz):

2023-10-24 00:11:34.643286: W tensorflow/core/framework/op_kernel.cc:1828] OP_REQUIRES failed at example_parsing_ops.cc:98 : INVALID_ARGUMENT: Key: roadgraph_samples/xyz.  Can't parse serialized Example.
...

Then, I tried changing the path in this way, using the '@' symbol along with the shard count:

path='gs://some_path_to_dataset/training_tfexample.tfrecord@1000',

With this way, my code was able to read the data correctly. However, since my dataset only contains one file, I am not sure if this method is applicable.

What I want to know is, how should I specify the path correctly if my dataset only contains one file? Should I include the '@' symbol and shard count? If so, how should I handle the dataset with a single file to ensure my code can read the data correctly?

Can anyone provide some assistance? Much appreciated!

You can specific num_path under the config file, e.g.
conf = DatasetConfig( path='gs://some_path_to_dataset/training_tfexample.tfrecord-00000-of-01000', max_num_rg_points=20000, data_format=DataFormat.TFRECORD, num_paths=1 )

@Jynxzzz That syntax works for me on the default google cloud paths (e.g. replacing ...tfrecord@N with ...tfrecord-00000-of-N).

The issue might be elsewhere - the key error happens when the data format doesn't match what is expected for any reason. For example, the number of roadgraph points differs between some of the datasets so make sure that's set properly (e.g. 20k vs 30k).