Training not starting but process still active

Question

Training not starting but process still active

anirudh-chakravarthy opened this issue 4 years ago · 3 comments

anirudh-chakravarthy commented 4 years ago

Hi,

I was training on the YT-VIS dataset. I've made my own dataloader for this. However, the training doesn't seem to begin for some reason. It's been at the same stage for the last 2 days. Could you please take a look?

I'm trying to initialize from the pre-trained network given in the README of this repo. I've removed the message for non-initialization of variables to maintain brevity! Look forward to hearing from you.

{
  # Engine settings
  "model": "conv3d_sep2_ytvis",
  "task": "train_no_val",
  "dataset": "YTVIS",
  "log_verbosity": 5,
  "gpus": 1,
  #"own_dataset_per_gpu": true,
  "use_summaries": false,
  "write_summaries": false,
  "collect_run_metadata": false,

  # MaskRCNN on/off
  "add_masks": true,

  # Pretrained model from tensorpack
  "load_init": "/n/pfister_lab2/Lab/anirudhchak/TrackR-CNN/models/converted",
  # Freeze applies to the whole model, not just our backend, but that's fine since only the backend uses batchnorm
  "freeze_batchnorm": true,
  "max_saves_to_keep": 1,

  # Training settings
  "batch_size": 8,
  "learning_rates": "{1: 0.0000005}",
  "optimizer": "adam",
  "num_epochs": 12,
  "max_saves_to_keep": 1,

  # Dataset options
  "KITTI_segtrack_data_dir": "/n/pfister_lab2/Lab/vcg_natural/kitti-mots/train/",
  "YTVIS_data_dir": "/n/pfister_lab2/Lab/anirudhchak/TrackR-CNN/yt-vis/train/",
  "MOTS_segtrack_data_dir": "/globalwork/krause/data/MOTS_challenge/train/",
  "optical_flow_path": "/globalwork/krause/data/KITTI_flow_pwc/",
  "prefer_gt_to_ignore": true,
  "use_ioa_for_ignore": true,
  "use_masks_for_ignore": false,
  "resize_mode_train": "fixed_size",
  "input_size_train": [640, 360],
  "resize_mode_val": "fixed_size",
  "input_size_val": [640, 360],

  "augmentors_train": ["flip", "gamma"],
  "num_parallel_calls": 6,
  "prefetch_buffer_size": 8,

  "mask_disjoint_strategy": "score",
  "tracker": "hungarian", "tracker_reid_comp": "euclidean", "detection_confidence_threshold": 0.8469800990815324, "reid_weight": 1.0, "mask_iou_weight": 0.0, "bbox_center_weight": 0.0, "bbox_iou_weight": 0.0, "association_threshold": 0.8165986526897969, "keep_alive": 4, "reid_euclidean_offset": 8.810218833503743, "reid_euclidean_scale": 1.0090931467228708,

  "network": {
    "resnetconv4": {"class": "ResNet101Conv4"},
    "conv3d_1": {"class": "SepConv3DOverBatch", "activation": "relu", "n_features": 1024, "init_type": "identity", "from": ["resnetconv4"], "old_order": true},
    "conv3d_2": {"class": "SepConv3DOverBatch", "activation": "relu", "n_features": 1024, "init_type": "identity", "from": ["conv3d_1"], "old_order": true},
    "frcnn": {"class": "FasterRCNN", "fastrcnn_batch_per_img": 64, "reid_dimension": 128, "reid_loss_per_class": false,
              "reid_loss_factor": 1.0, "reid_loss_variant": 1, "reid_measure": "euclidean", "from": ["conv3d_2"],
              "class_agnostic_box_and_mask_heads": true}
  }
}

loading annotations into memory...
Done (t=16.46s)
creating index...
index created!
creating trainnet...
inputs:
images [8, 640, 360, 3]
classes [8, 100]
ids [8, 100]
is_crowd [8, 100]
segmentation_mask [8, 640, 360, 100]
raw_image_sizes [8, 2]
inputs [8, 640, 360, 3]
bboxes_x0y0x1y1 [8, 100, 4]
featuremap_boxes [8, 40, 40, 15, 4]
featuremap_labels [8, 40, 40, 15]
skip_example []
network:
resnetconv4: [8, 40, 22, 1024], 0 params
conv3d_1: [8, 40, 22, 1024], 1076224 params
conv3d_2: [8, 40, 22, 1024], 1076224 params
frcnn: , 0 params
number of parameters: 2,152,448
the following variables will not be initialized since they are not present in the initialization model [A LOT OF VARIABLES, REMOVED FOR BREVITY!]
the following variables will not be loaded from the file since they are not present in the graph 
the following variables will not be loaded from the file since the shapes in the graph and in the file don't match: [('frcnn/fastrcnn/class/W:0', TensorShape([Dimension(2048), Dimension(41)])), ('frcnn/fastrcnn/class/b:0', TensorShape([Dimension(41)]))]
initializing model from /n/pfister_lab2/Lab/anirudhchak/TrackR-CNN/models/converted
starting training

Answer 1 · 2020-06-11T14:36:34.000Z

After 2 days, I get this error. I assume something is wrong with the dataloader? Would you be able to point out some area which I may be overlooking?

Traceback (most recent call last):
  File "/n/home09/achakravarthy/.conda/envs/trcnn-kitti/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/n/home09/achakravarthy/.conda/envs/trcnn-kitti/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/n/home09/achakravarthy/.conda/envs/trcnn-kitti/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence
	 [[{{node IteratorGetNext}}]]
	 [[{{node trainnet/tower_gpu_0/frcnn_3/StopGradient_4}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 39, in <module>
    tf.app.run(main)
  File "/n/home09/achakravarthy/.conda/envs/trcnn-kitti/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "main.py", line 35, in main
    engine.run()
  File "/net/coxfs01/srv/export/coxfs01/pfister_lab2/share_root/Lab/anirudhchak/TrackR-CNN/core/Engine.py", line 81, in run
    self.train()
  File "/net/coxfs01/srv/export/coxfs01/pfister_lab2/share_root/Lab/anirudhchak/TrackR-CNN/core/Engine.py", line 112, in train
    train_measures = self.run_epoch(self.trainer.train_step, self.train_data, epoch, is_train_run=True)
  File "/net/coxfs01/srv/export/coxfs01/pfister_lab2/share_root/Lab/anirudhchak/TrackR-CNN/core/Engine.py", line 144, in run_epoch
    res = step_fn(epoch, n_examples_processed_total=n_examples_processed_total, extraction_keys=extraction_keys)
  File "/net/coxfs01/srv/export/coxfs01/pfister_lab2/share_root/Lab/anirudhchak/TrackR-CNN/core/Trainer.py", line 168, in train_step
    res = self._step(self.train_network, feed_dict, ops, self.summary_op_train, extraction_keys, step_number=None)
  File "/net/coxfs01/srv/export/coxfs01/pfister_lab2/share_root/Lab/anirudhchak/TrackR-CNN/core/Trainer.py", line 188, in _step
    res = self.session.run(ops, feed_dict=feed_dict, options=run_options, run_metadata=run_metadata)
  File "/n/home09/achakravarthy/.conda/envs/trcnn-kitti/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/n/home09/achakravarthy/.conda/envs/trcnn-kitti/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/n/home09/achakravarthy/.conda/envs/trcnn-kitti/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/n/home09/achakravarthy/.conda/envs/trcnn-kitti/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence
	 [[node IteratorGetNext (defined at /net/coxfs01/srv/export/coxfs01/pfister_lab2/share_root/Lab/anirudhchak/TrackR-CNN/datasets/Dataset.py:303) ]]
	 [[node trainnet/tower_gpu_0/frcnn_3/StopGradient_4 (defined at /net/coxfs01/srv/export/coxfs01/pfister_lab2/share_root/Lab/anirudhchak/TrackR-CNN/network/FasterRCNN_utils.py:328) ]]

Caused by op 'IteratorGetNext', defined at:
  File "main.py", line 39, in <module>
    tf.app.run(main)
  File "/n/home09/achakravarthy/.conda/envs/trcnn-kitti/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "main.py", line 34, in main
    engine = Engine(config)
  File "/net/coxfs01/srv/export/coxfs01/pfister_lab2/share_root/Lab/anirudhchak/TrackR-CNN/core/Engine.py", line 41, in __init__
    name="trainnet")
  File "/net/coxfs01/srv/export/coxfs01/pfister_lab2/share_root/Lab/anirudhchak/TrackR-CNN/network/Network.py", line 16, in __init__
    self.input_tensors_dict = dataset.create_input_tensors_dict(self.batch_size)
  File "/net/coxfs01/srv/export/coxfs01/pfister_lab2/share_root/Lab/anirudhchak/TrackR-CNN/datasets/Dataset.py", line 303, in create_input_tensors_dict
    res = tfdata.make_one_shot_iterator().get_next()
  File "/n/home09/achakravarthy/.conda/envs/trcnn-kitti/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 414, in get_next
    output_shapes=self._structure._flat_shapes, name=name)
  File "/n/home09/achakravarthy/.conda/envs/trcnn-kitti/lib/python3.6/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 1685, in iterator_get_next
    output_shapes=output_shapes, name=name)
  File "/n/home09/achakravarthy/.conda/envs/trcnn-kitti/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/n/home09/achakravarthy/.conda/envs/trcnn-kitti/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/n/home09/achakravarthy/.conda/envs/trcnn-kitti/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/n/home09/achakravarthy/.conda/envs/trcnn-kitti/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

OutOfRangeError (see above for traceback): End of sequence
	 [[node IteratorGetNext (defined at /net/coxfs01/srv/export/coxfs01/pfister_lab2/share_root/Lab/anirudhchak/TrackR-CNN/datasets/Dataset.py:303) ]]
	 [[node trainnet/tower_gpu_0/frcnn_3/StopGradient_4 (defined at /net/coxfs01/srv/export/coxfs01/pfister_lab2/share_root/Lab/anirudhchak/TrackR-CNN/network/FasterRCNN_utils.py:328) ]]

Answer 2 · 2020-09-16T12:01:37.000Z

@anirudh-chakravarthy Hi！Bro，how did you solve this bug at last?

Answer 3 · 2020-09-17T06:44:03.000Z

Hi,

This issue is caused by problem in the data loader. You need to ensure images are being loaded correctly, in the right sequence!