NVIDIA-AI-IOT/synthetic_data_generation_training_workflow

File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 1034, in GetNext return _pywrap_tensorflow_internal.PyRecordReader_GetNext(self) tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 0 Execution status: FAIL

monajalal opened this issue · 1 comments

I get this error during training. All steps before this PASS with no error. Could you please guide me?

!docker run -it --rm --gpus all -v $LOCAL_PROJECT_DIR:/workspace/tao-experiments $DOCKER_CONTAINER \ detectnet_v2 train -e /workspace/tao-experiments/local/training/tao/specs/training/resnet18_distractors.txt \ -r /workspace/tao-experiments/local/training/tao/detectnet_v2/resnet18_palletjack -k $KEY --gpus $NUM_GPUS


==============================
=== TAO Toolkit TensorFlow ===
==============================

NVIDIA Release 4.0.0-TensorFlow (build )
TAO Toolkit Version 4.0.0

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the TAO Toolkit End User License Agreement.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/tao-toolkit-software-license-agreement

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for TAO Toolkit.  NVIDIA recommends the use of the following flags:
   docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

Using TensorFlow backend.
2024-02-12 19:31:44.622414: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
Using TensorFlow backend.
2024-02-12 19:31:51,462 [INFO] root: Starting DetectNet_v2 Training job
2024-02-12 19:31:51,462 [INFO] __main__: Loading experiment spec at /workspace/tao-experiments/local/training/tao/specs/training/resnet18_distractors.txt.
2024-02-12 19:31:51,464 [INFO] iva.detectnet_v2.spec_handler.spec_loader: Merging specification from /workspace/tao-experiments/local/training/tao/specs/training/resnet18_distractors.txt
2024-02-12 19:31:51,467 [INFO] root: Training gridbox model.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

2024-02-12 19:31:51,467 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

Traceback (most recent call last):
  File "</usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/scripts/train.py>", line 3, in <module>
  File "<frozen iva.detectnet_v2.scripts.train>", line 1022, in <module>
  File "<frozen iva.detectnet_v2.scripts.train>", line 1011, in <module>
  File "<decorator-gen-117>", line 2, in main
  File "<frozen iva.detectnet_v2.utilities.timer>", line 46, in wrapped_fn
  File "<frozen iva.detectnet_v2.scripts.train>", line 994, in main
  File "<frozen iva.detectnet_v2.scripts.train>", line 853, in run_experiment
  File "<frozen iva.detectnet_v2.scripts.train>", line 625, in train_gridbox
  File "<frozen iva.detectnet_v2.dataloader.build_dataloader>", line 273, in build_dataloader
  File "<frozen iva.detectnet_v2.dataloader.drivenet_dataloader>", line 491, in __init__
  File "<frozen iva.detectnet_v2.dataloader.drivenet_dataloader>", line 548, in _construct_data_sources
  File "<frozen iva.detectnet_v2.dataloader.drivenet_dataloader>", line 395, in __init__
  File "<frozen iva.detectnet_v2.dataloader.drivenet_dataloader>", line 395, in <listcomp>
  File "<frozen iva.detectnet_v2.dataloader.drivenet_dataloader>", line 394, in <genexpr>
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/lib/io/tf_record.py", line 181, in tf_record_iterator
    reader.GetNext()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 1034, in GetNext
    return _pywrap_tensorflow_internal.PyRecordReader_GetNext(self)
tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 0
Execution status: FAIL

I get this error during training. All steps before this PASS with no error. Could you please guide me?

!docker run -it --rm --gpus all -v $LOCAL_PROJECT_DIR:/workspace/tao-experiments $DOCKER_CONTAINER \ detectnet_v2 train -e /workspace/tao-experiments/local/training/tao/specs/training/resnet18_distractors.txt \ -r /workspace/tao-experiments/local/training/tao/detectnet_v2/resnet18_palletjack -k $KEY --gpus $NUM_GPUS


==============================
=== TAO Toolkit TensorFlow ===
==============================

NVIDIA Release 4.0.0-TensorFlow (build )
TAO Toolkit Version 4.0.0

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the TAO Toolkit End User License Agreement.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/tao-toolkit-software-license-agreement

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for TAO Toolkit.  NVIDIA recommends the use of the following flags:
   docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

Using TensorFlow backend.
2024-02-12 19:31:44.622414: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
Using TensorFlow backend.
2024-02-12 19:31:51,462 [INFO] root: Starting DetectNet_v2 Training job
2024-02-12 19:31:51,462 [INFO] __main__: Loading experiment spec at /workspace/tao-experiments/local/training/tao/specs/training/resnet18_distractors.txt.
2024-02-12 19:31:51,464 [INFO] iva.detectnet_v2.spec_handler.spec_loader: Merging specification from /workspace/tao-experiments/local/training/tao/specs/training/resnet18_distractors.txt
2024-02-12 19:31:51,467 [INFO] root: Training gridbox model.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

2024-02-12 19:31:51,467 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

Traceback (most recent call last):
  File "</usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/scripts/train.py>", line 3, in <module>
  File "<frozen iva.detectnet_v2.scripts.train>", line 1022, in <module>
  File "<frozen iva.detectnet_v2.scripts.train>", line 1011, in <module>
  File "<decorator-gen-117>", line 2, in main
  File "<frozen iva.detectnet_v2.utilities.timer>", line 46, in wrapped_fn
  File "<frozen iva.detectnet_v2.scripts.train>", line 994, in main
  File "<frozen iva.detectnet_v2.scripts.train>", line 853, in run_experiment
  File "<frozen iva.detectnet_v2.scripts.train>", line 625, in train_gridbox
  File "<frozen iva.detectnet_v2.dataloader.build_dataloader>", line 273, in build_dataloader
  File "<frozen iva.detectnet_v2.dataloader.drivenet_dataloader>", line 491, in __init__
  File "<frozen iva.detectnet_v2.dataloader.drivenet_dataloader>", line 548, in _construct_data_sources
  File "<frozen iva.detectnet_v2.dataloader.drivenet_dataloader>", line 395, in __init__
  File "<frozen iva.detectnet_v2.dataloader.drivenet_dataloader>", line 395, in <listcomp>
  File "<frozen iva.detectnet_v2.dataloader.drivenet_dataloader>", line 394, in <genexpr>
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/lib/io/tf_record.py", line 181, in tf_record_iterator
    reader.GetNext()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 1034, in GetNext
    return _pywrap_tensorflow_internal.PyRecordReader_GetNext(self)
tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 0
Execution status: FAIL

I also encountered the same problem and couldn't solve it.How did you solve it?