Training not starting but process still active
anirudh-chakravarthy opened this issue · 3 comments
Hi,
I was training on the YT-VIS dataset. I've made my own dataloader for this. However, the training doesn't seem to begin for some reason. It's been at the same stage for the last 2 days. Could you please take a look?
I'm trying to initialize from the pre-trained network given in the README of this repo. I've removed the message for non-initialization of variables to maintain brevity! Look forward to hearing from you.
{
# Engine settings
"model": "conv3d_sep2_ytvis",
"task": "train_no_val",
"dataset": "YTVIS",
"log_verbosity": 5,
"gpus": 1,
#"own_dataset_per_gpu": true,
"use_summaries": false,
"write_summaries": false,
"collect_run_metadata": false,
# MaskRCNN on/off
"add_masks": true,
# Pretrained model from tensorpack
"load_init": "/n/pfister_lab2/Lab/anirudhchak/TrackR-CNN/models/converted",
# Freeze applies to the whole model, not just our backend, but that's fine since only the backend uses batchnorm
"freeze_batchnorm": true,
"max_saves_to_keep": 1,
# Training settings
"batch_size": 8,
"learning_rates": "{1: 0.0000005}",
"optimizer": "adam",
"num_epochs": 12,
"max_saves_to_keep": 1,
# Dataset options
"KITTI_segtrack_data_dir": "/n/pfister_lab2/Lab/vcg_natural/kitti-mots/train/",
"YTVIS_data_dir": "/n/pfister_lab2/Lab/anirudhchak/TrackR-CNN/yt-vis/train/",
"MOTS_segtrack_data_dir": "/globalwork/krause/data/MOTS_challenge/train/",
"optical_flow_path": "/globalwork/krause/data/KITTI_flow_pwc/",
"prefer_gt_to_ignore": true,
"use_ioa_for_ignore": true,
"use_masks_for_ignore": false,
"resize_mode_train": "fixed_size",
"input_size_train": [640, 360],
"resize_mode_val": "fixed_size",
"input_size_val": [640, 360],
"augmentors_train": ["flip", "gamma"],
"num_parallel_calls": 6,
"prefetch_buffer_size": 8,
"mask_disjoint_strategy": "score",
"tracker": "hungarian", "tracker_reid_comp": "euclidean", "detection_confidence_threshold": 0.8469800990815324, "reid_weight": 1.0, "mask_iou_weight": 0.0, "bbox_center_weight": 0.0, "bbox_iou_weight": 0.0, "association_threshold": 0.8165986526897969, "keep_alive": 4, "reid_euclidean_offset": 8.810218833503743, "reid_euclidean_scale": 1.0090931467228708,
"network": {
"resnetconv4": {"class": "ResNet101Conv4"},
"conv3d_1": {"class": "SepConv3DOverBatch", "activation": "relu", "n_features": 1024, "init_type": "identity", "from": ["resnetconv4"], "old_order": true},
"conv3d_2": {"class": "SepConv3DOverBatch", "activation": "relu", "n_features": 1024, "init_type": "identity", "from": ["conv3d_1"], "old_order": true},
"frcnn": {"class": "FasterRCNN", "fastrcnn_batch_per_img": 64, "reid_dimension": 128, "reid_loss_per_class": false,
"reid_loss_factor": 1.0, "reid_loss_variant": 1, "reid_measure": "euclidean", "from": ["conv3d_2"],
"class_agnostic_box_and_mask_heads": true}
}
}
loading annotations into memory...
Done (t=16.46s)
creating index...
index created!
creating trainnet...
inputs:
images [8, 640, 360, 3]
classes [8, 100]
ids [8, 100]
is_crowd [8, 100]
segmentation_mask [8, 640, 360, 100]
raw_image_sizes [8, 2]
inputs [8, 640, 360, 3]
bboxes_x0y0x1y1 [8, 100, 4]
featuremap_boxes [8, 40, 40, 15, 4]
featuremap_labels [8, 40, 40, 15]
skip_example []
network:
resnetconv4: [8, 40, 22, 1024], 0 params
conv3d_1: [8, 40, 22, 1024], 1076224 params
conv3d_2: [8, 40, 22, 1024], 1076224 params
frcnn: , 0 params
number of parameters: 2,152,448
the following variables will not be initialized since they are not present in the initialization model [A LOT OF VARIABLES, REMOVED FOR BREVITY!]
the following variables will not be loaded from the file since they are not present in the graph
the following variables will not be loaded from the file since the shapes in the graph and in the file don't match: [('frcnn/fastrcnn/class/W:0', TensorShape([Dimension(2048), Dimension(41)])), ('frcnn/fastrcnn/class/b:0', TensorShape([Dimension(41)]))]
initializing model from /n/pfister_lab2/Lab/anirudhchak/TrackR-CNN/models/converted
starting training
After 2 days, I get this error. I assume something is wrong with the dataloader? Would you be able to point out some area which I may be overlooking?
Traceback (most recent call last):
File "/n/home09/achakravarthy/.conda/envs/trcnn-kitti/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/n/home09/achakravarthy/.conda/envs/trcnn-kitti/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/n/home09/achakravarthy/.conda/envs/trcnn-kitti/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence
[[{{node IteratorGetNext}}]]
[[{{node trainnet/tower_gpu_0/frcnn_3/StopGradient_4}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "main.py", line 39, in <module>
tf.app.run(main)
File "/n/home09/achakravarthy/.conda/envs/trcnn-kitti/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "main.py", line 35, in main
engine.run()
File "/net/coxfs01/srv/export/coxfs01/pfister_lab2/share_root/Lab/anirudhchak/TrackR-CNN/core/Engine.py", line 81, in run
self.train()
File "/net/coxfs01/srv/export/coxfs01/pfister_lab2/share_root/Lab/anirudhchak/TrackR-CNN/core/Engine.py", line 112, in train
train_measures = self.run_epoch(self.trainer.train_step, self.train_data, epoch, is_train_run=True)
File "/net/coxfs01/srv/export/coxfs01/pfister_lab2/share_root/Lab/anirudhchak/TrackR-CNN/core/Engine.py", line 144, in run_epoch
res = step_fn(epoch, n_examples_processed_total=n_examples_processed_total, extraction_keys=extraction_keys)
File "/net/coxfs01/srv/export/coxfs01/pfister_lab2/share_root/Lab/anirudhchak/TrackR-CNN/core/Trainer.py", line 168, in train_step
res = self._step(self.train_network, feed_dict, ops, self.summary_op_train, extraction_keys, step_number=None)
File "/net/coxfs01/srv/export/coxfs01/pfister_lab2/share_root/Lab/anirudhchak/TrackR-CNN/core/Trainer.py", line 188, in _step
res = self.session.run(ops, feed_dict=feed_dict, options=run_options, run_metadata=run_metadata)
File "/n/home09/achakravarthy/.conda/envs/trcnn-kitti/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/n/home09/achakravarthy/.conda/envs/trcnn-kitti/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/n/home09/achakravarthy/.conda/envs/trcnn-kitti/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/n/home09/achakravarthy/.conda/envs/trcnn-kitti/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence
[[node IteratorGetNext (defined at /net/coxfs01/srv/export/coxfs01/pfister_lab2/share_root/Lab/anirudhchak/TrackR-CNN/datasets/Dataset.py:303) ]]
[[node trainnet/tower_gpu_0/frcnn_3/StopGradient_4 (defined at /net/coxfs01/srv/export/coxfs01/pfister_lab2/share_root/Lab/anirudhchak/TrackR-CNN/network/FasterRCNN_utils.py:328) ]]
Caused by op 'IteratorGetNext', defined at:
File "main.py", line 39, in <module>
tf.app.run(main)
File "/n/home09/achakravarthy/.conda/envs/trcnn-kitti/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "main.py", line 34, in main
engine = Engine(config)
File "/net/coxfs01/srv/export/coxfs01/pfister_lab2/share_root/Lab/anirudhchak/TrackR-CNN/core/Engine.py", line 41, in __init__
name="trainnet")
File "/net/coxfs01/srv/export/coxfs01/pfister_lab2/share_root/Lab/anirudhchak/TrackR-CNN/network/Network.py", line 16, in __init__
self.input_tensors_dict = dataset.create_input_tensors_dict(self.batch_size)
File "/net/coxfs01/srv/export/coxfs01/pfister_lab2/share_root/Lab/anirudhchak/TrackR-CNN/datasets/Dataset.py", line 303, in create_input_tensors_dict
res = tfdata.make_one_shot_iterator().get_next()
File "/n/home09/achakravarthy/.conda/envs/trcnn-kitti/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 414, in get_next
output_shapes=self._structure._flat_shapes, name=name)
File "/n/home09/achakravarthy/.conda/envs/trcnn-kitti/lib/python3.6/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 1685, in iterator_get_next
output_shapes=output_shapes, name=name)
File "/n/home09/achakravarthy/.conda/envs/trcnn-kitti/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/n/home09/achakravarthy/.conda/envs/trcnn-kitti/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/n/home09/achakravarthy/.conda/envs/trcnn-kitti/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/n/home09/achakravarthy/.conda/envs/trcnn-kitti/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
self._traceback = tf_stack.extract_stack()
OutOfRangeError (see above for traceback): End of sequence
[[node IteratorGetNext (defined at /net/coxfs01/srv/export/coxfs01/pfister_lab2/share_root/Lab/anirudhchak/TrackR-CNN/datasets/Dataset.py:303) ]]
[[node trainnet/tower_gpu_0/frcnn_3/StopGradient_4 (defined at /net/coxfs01/srv/export/coxfs01/pfister_lab2/share_root/Lab/anirudhchak/TrackR-CNN/network/FasterRCNN_utils.py:328) ]]
@anirudh-chakravarthy Hi!Bro,how did you solve this bug at last?
Hi,
This issue is caused by problem in the data loader. You need to ensure images are being loaded correctly, in the right sequence!