During training the following errors are generated

Question

During training the following errors are generated

raghavw7 opened this issue 5 years ago · 2 comments

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/aocr/main.py:20: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/aocr/main.py:20: The name tf.logging.ERROR is deprecated. Please use tf.compat.v1.logging.ERROR instead.

2019-12-01 01:04:29.729170: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2200000000 Hz
2019-12-01 01:04:29.729380: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2b8d100 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2019-12-01 01:04:29.729414: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2019-12-01 01:04:29.731396: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2019-12-01 01:04:29.834754: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-01 01:04:29.835445: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2b8d2c0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2019-12-01 01:04:29.835479: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla P100-PCIE-16GB, Compute Capability 6.0
2019-12-01 01:04:29.835766: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-01 01:04:29.836291: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:00:04.0
2019-12-01 01:04:29.836619: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2019-12-01 01:04:29.838119: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2019-12-01 01:04:29.839683: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2019-12-01 01:04:29.840002: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2019-12-01 01:04:29.841474: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2019-12-01 01:04:29.842198: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2019-12-01 01:04:29.845373: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-12-01 01:04:29.845487: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-01 01:04:29.846084: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-01 01:04:29.846582: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-12-01 01:04:29.846639: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2019-12-01 01:04:29.847767: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-12-01 01:04:29.847790: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2019-12-01 01:04:29.847800: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2019-12-01 01:04:29.847900: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-01 01:04:29.848430: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-01 01:04:29.848931: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2019-12-01 01:04:29.848972: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15216 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0)
2019-12-01 01:04:29,849 root INFO phase: train
2019-12-01 01:04:29,850 root INFO model_dir: ./checkpoints
2019-12-01 01:04:29,850 root INFO load_model: True
2019-12-01 01:04:29,850 root INFO output_dir: ./results
2019-12-01 01:04:29,850 root INFO steps_per_checkpoint: 100
2019-12-01 01:04:29,850 root INFO batch_size: 65
2019-12-01 01:04:29,850 root INFO learning_rate: 1.000000
2019-12-01 01:04:29,851 root INFO reg_val: 0
2019-12-01 01:04:29,851 root INFO max_gradient_norm: 5.000000
2019-12-01 01:04:29,851 root INFO clip_gradients: True
2019-12-01 01:04:29,851 root INFO max_image_width 160.000000
2019-12-01 01:04:29,851 root INFO max_prediction_length 8.000000
2019-12-01 01:04:29,851 root INFO channels: 1
2019-12-01 01:04:29,851 root INFO target_embedding_size: 10.000000
2019-12-01 01:04:29,851 root INFO attn_num_hidden: 128
2019-12-01 01:04:29,851 root INFO attn_num_layers: 2
2019-12-01 01:04:29,851 root INFO visualize: False
2019-12-01 01:04:36,559 root INFO Created model with fresh parameters.
2019-12-01 01:04:36.707966: W tensorflow/core/common_runtime/colocation_graph.cc:983] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
/job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:0' assigned_device_name_='' resource_device_name_='/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
Switch: GPU CPU XLA_CPU XLA_GPU
Enter: GPU CPU XLA_CPU XLA_GPU
LookupTableFindV2: CPU
LookupTableInsertV2: CPU
MutableHashTableV2: CPU
LookupTableExportV2: CPU

Colocation members, user-requested devices, and framework assigned devices, if any:
MutableHashTable (MutableHashTableV2) /device:GPU:0
MutableHashTable_lookup_table_export_values/LookupTableExportV2 (LookupTableExportV2) /device:GPU:0
MutableHashTable_lookup_table_insert/LookupTableInsertV2 (LookupTableInsertV2) /device:GPU:0
map_1/while/foldr/while/cond/MutableHashTable_lookup_table_find/LookupTableFindV2/Enter (Enter) /device:GPU:0
map_1/while/foldr/while/cond/MutableHashTable_lookup_table_find/LookupTableFindV2/Enter_1 (Enter) /device:GPU:0
map_1/while/foldr/while/cond/MutableHashTable_lookup_table_find/LookupTableFindV2/Switch (Switch) /device:GPU:0
map_1/while/foldr/while/cond/MutableHashTable_lookup_table_find/LookupTableFindV2/Enter_2 (Enter) /device:GPU:0
map_1/while/foldr/while/cond/MutableHashTable_lookup_table_find/LookupTableFindV2/Enter_3 (Enter) /device:GPU:0
map_1/while/foldr/while/cond/MutableHashTable_lookup_table_find/LookupTableFindV2/Switch_2 (Switch) /device:GPU:0
map_1/while/foldr/while/cond/MutableHashTable_lookup_table_find/LookupTableFindV2 (LookupTableFindV2) /device:GPU:0

2019-12-01 01:04:37,606 root INFO num_epoch: 1000
2019-12-01 01:04:38,601 root INFO Starting the training process.
2019-12-01 01:04:38.616510: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-01 01:04:38.617067: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:00:04.0
2019-12-01 01:04:38.617150: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2019-12-01 01:04:38.617179: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2019-12-01 01:04:38.617199: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2019-12-01 01:04:38.617220: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2019-12-01 01:04:38.617239: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2019-12-01 01:04:38.617257: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2019-12-01 01:04:38.617285: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-12-01 01:04:38.617387: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-01 01:04:38.617992: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-01 01:04:38.618477: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-12-01 01:04:38.618525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-12-01 01:04:38.618539: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2019-12-01 01:04:38.618548: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2019-12-01 01:04:38.618633: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-01 01:04:38.619163: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-12-01 01:04:38.619650: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15216 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0)
2019-12-01 01:04:38.759750: W tensorflow/core/common_runtime/colocation_graph.cc:983] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
/job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:0' assigned_device_name_='' resource_device_name_='/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
Switch: GPU CPU XLA_CPU XLA_GPU
Enter: GPU CPU XLA_CPU XLA_GPU
LookupTableFindV2: CPU
LookupTableInsertV2: CPU
MutableHashTableV2: CPU
LookupTableExportV2: CPU

Colocation members, user-requested devices, and framework assigned devices, if any:
MutableHashTable (MutableHashTableV2) /device:GPU:0
MutableHashTable_lookup_table_export_values/LookupTableExportV2 (LookupTableExportV2) /device:GPU:0
MutableHashTable_lookup_table_insert/LookupTableInsertV2 (LookupTableInsertV2) /device:GPU:0
map_1/while/foldr/while/cond/MutableHashTable_lookup_table_find/LookupTableFindV2/Enter (Enter) /device:GPU:0
map_1/while/foldr/while/cond/MutableHashTable_lookup_table_find/LookupTableFindV2/Enter_1 (Enter) /device:GPU:0
map_1/while/foldr/while/cond/MutableHashTable_lookup_table_find/LookupTableFindV2/Switch (Switch) /device:GPU:0
map_1/while/foldr/while/cond/MutableHashTable_lookup_table_find/LookupTableFindV2/Enter_2 (Enter) /device:GPU:0
map_1/while/foldr/while/cond/MutableHashTable_lookup_table_find/LookupTableFindV2/Enter_3 (Enter) /device:GPU:0
map_1/while/foldr/while/cond/MutableHashTable_lookup_table_find/LookupTableFindV2/Switch_2 (Switch) /device:GPU:0
map_1/while/foldr/while/cond/MutableHashTable_lookup_table_find/LookupTableFindV2 (LookupTableFindV2) /device:GPU:0

Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: content/dataset.tfrecords; No such file or directory
[[{{node IteratorGetNext}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/bin/aocr", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.6/dist-packages/aocr/main.py", line 257, in main
num_epoch=parameters.num_epoch
File "/usr/local/lib/python3.6/dist-packages/aocr/model/model.py", line 376, in train
for batch in s_gen.gen(self.batch_size):
File "/usr/local/lib/python3.6/dist-packages/aocr/util/data_gen.py", line 67, in gen
raw_images, raw_labels, raw_comments = sess.run([images, labels, comments])
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: content/dataset.tfrecords; No such file or directory
[[node IteratorGetNext (defined at /lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Original stack trace for 'IteratorGetNext':
File "/bin/aocr", line 8, in
sys.exit(main())
File "/lib/python3.6/dist-packages/aocr/main.py", line 257, in main
num_epoch=parameters.num_epoch
File "/lib/python3.6/dist-packages/aocr/model/model.py", line 376, in train
for batch in s_gen.gen(self.batch_size):
File "/lib/python3.6/dist-packages/aocr/util/data_gen.py", line 62, in gen
images, labels, comments = iterator.get_next()
File "/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/iterator_ops.py", line 426, in get_next
name=name)
File "/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_dataset_ops.py", line 2518, in iterator_get_next
output_shapes=output_shapes, name=name)
File "/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in init
self._traceback = tf_stack.extract_stack()

Answer 1 · 2019-12-01T03:26:19.000Z

make sure that your dataset.tfrecords path is correct if still doesn't work put absolute path.

Answer 2 · 2019-12-01T23:44:57.000Z

Thank you! @macabdul9 it worked once I entered the correct path.