MarvinTeichmann/MultiNet

When I train the multinet2,there is a problem

1120651074 opened this issue · 2 comments

Hi @MarvinTeichmann

Thanks for releasing these codes again!

When I train multinet2 using kitti data by myself ,but there is a problem:

`W tensorflow/core/kernels/queue_base.cc:294] _1_Queues_detection/fifo_queue: Skipping cancelled enqueue attempt with queue not closed
Traceback (most recent call last):
File "train.py", line 616, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "train.py", line 608, in main
tv_sess, start_step=start_step)
File "train.py", line 229, in run_united_training
sess.run([subgraph[model]['train_op']], feed_dict=feed_dict)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 786, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 994, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1044, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1064, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[512,512,3,3]
[[Node: conv4_2_1/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](conv4_1_1/Relu, conv4_2/filter/read)]]
[[Node: training/Adam_1/update/_72 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_7851_training/Adam_1/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]

Caused by op u'conv4_2_1/Conv2D', defined at:
File "train.py", line 616, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "train.py", line 595, in main
subhypes, submodules, subgraph, tv_sess = build_united_model(hypes)
File "train.py", line 535, in build_united_model
first_iter)
File "train.py", line 130, in build_training_graph
logits = encoder.inference(hypes, image, train=True)
File "/home/nextcar/MultiNet/submodules/KittiBox/hypes/../encoder/vgg.py", line 28, in inference
random_init_fc8=True)
File "/home/nextcar/MultiNet/incl/tensorflow_fcn/fcn8_vgg.py", line 88, in build
self.conv4_2 = self._conv_layer(self.conv4_1, "conv4_2")
File "/home/nextcar/MultiNet/incl/tensorflow_fcn/fcn8_vgg.py", line 155, in _conv_layer
conv = tf.nn.conv2d(bottom, filt, [1, 1, 1, 1], padding='SAME')
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 416, in conv2d
data_format=data_format, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2336, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1228, in init
self._traceback = _extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[512,512,3,3]
[[Node: conv4_2_1/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](conv4_1_1/Relu, conv4_2/filter/read)]]
[[Node: training/Adam_1/update/_72 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_7851_training/Adam_1/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]

Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/nextcar/MultiNet/submodules/KittiSeg/hypes/../inputs/kitti_seg_input.py", line 351, in enqueue_loop
sess.run(enqueue_op, feed_dict=make_feed(d))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 786, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 994, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1044, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1064, in _do_call
raise type(e)(node_def, op, message)
CancelledError: Enqueue operation was cancelled
[[Node: fifo_queue_enqueue = QueueEnqueueV2[Tcomponents=[DT_FLOAT, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](Queues_segmentation/fifo_queue, _recv_Placeholder_2_0, _recv_Placeholder_3_0)]]

Caused by op u'fifo_queue_enqueue', defined at:
File "train.py", line 616, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "train.py", line 595, in main
subhypes, submodules, subgraph, tv_sess = build_united_model(hypes)
File "train.py", line 564, in build_united_model
'train', sess)
File "/home/nextcar/MultiNet/submodules/KittiSeg/hypes/../inputs/kitti_seg_input.py", line 353, in start_enqueuing_threads
enqueue_op = q.enqueue((image_pl, label_pl))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/data_flow_ops.py", line 322, in enqueue
self._queue_ref, vals, name=scope)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 1587, in _queue_enqueue_v2
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2336, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1228, in init
self._traceback = _extract_stack()

CancelledError (see above for traceback): Enqueue operation was cancelled
[[Node: fifo_queue_enqueue = QueueEnqueueV2[Tcomponents=[DT_FLOAT, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](Queues_segmentation/fifo_queue, _recv_Placeholder_2_0, _recv_Placeholder_3_0)]]

Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/nextcar/MultiNet/submodules/KittiBox/hypes/../inputs/kitti_input.py", line 230, in thread_loop
sess.run(enqueue_op, feed_dict=make_feed(d))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 786, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 994, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1044, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1064, in _do_call
raise type(e)(node_def, op, message)
CancelledError: Enqueue operation was cancelled
[[Node: fifo_queue_enqueue_1 = QueueEnqueueV2[Tcomponents=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](Queues_detection/fifo_queue, _recv_Placeholder_4_0, _recv_Placeholder_5_0, _recv_Placeholder_6_0, _recv_Placeholder_7_0)]]

Caused by op u'fifo_queue_enqueue_1', defined at:
File "train.py", line 616, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "train.py", line 595, in main
subhypes, submodules, subgraph, tv_sess = build_united_model(hypes)
File "train.py", line 564, in build_united_model
'train', sess)
File "/home/nextcar/MultiNet/submodules/KittiBox/hypes/../inputs/kitti_input.py", line 220, in start_enqueuing_threads
enqueue_op = q.enqueue((x_in, confs_in, boxes_in, mask_in))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/data_flow_ops.py", line 322, in enqueue
self._queue_ref, vals, name=scope)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 1587, in _queue_enqueue_v2
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2336, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1228, in init
self._traceback = _extract_stack()

CancelledError (see above for traceback): Enqueue operation was cancelled
[[Node: fifo_queue_enqueue_1 = QueueEnqueueV2[Tcomponents=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](Queues_detection/fifo_queue, _recv_Placeholder_4_0, _recv_Placeholder_5_0, _recv_Placeholder_6_0, _recv_Placeholder_7_0)]]`

I don't know how to solve it. Could you tell me the reason is?

Thank you!

No, never run across this error.

I Have same problem. Can you give me some tips?