tensorflow/models

object_detection Shape mismatch after adjusting the pipeline config, anything missing?

Closed this issue · 3 comments

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am using the latest TensorFlow Model Garden release and TensorFlow 2.
  • I am reporting the issue to the correct repository. (Model Garden official or research directory)
  • I checked to make sure that this issue has not already been filed.

1. The entire URL of the file you are using

Base config file: https://raw.githubusercontent.com/tensorflow/models/master/research/object_detection/configs/tf2/faster_rcnn_resnet101_v1_640x640_coco17_tpu-8.config

Base checkpoints: http://download.tensorflow.org/models/object_detection/tf2/20200711/faster_rcnn_resnet101_v1_640x640_coco17_tpu-8.tar.gz

2. Describe the bug

This is related to fine-tuning a detection model, in my case, it is based on Faster R-CNN. I have changed all the necessary parameters in the configuration file with respect to the dataset I have. The config file can be found here.

I am launching the training locally with the following command:

python ~/models/research/object_detection/model_main_tf2.py \
    --pipeline_config_path=${PIPELINE_CONFIG_PATH} \
    --model_dir=${MODEL_DIR} \
    --alsologtostderrv

Where, PIPELINE_CONFIG_PATH is defined as home/jupyter/ssds_and_rcnn/lisa/experiments/training/faster_rcnn_resnet101_v1_640x640_coco17_tpu-8.config and MODEL_DIR is defined as /home/jupyter/ssds_and_rcnn/lisa/experiments/training/faster_rcnn_resnet101_v1_640x640_coco17_tpu-8/checkpoint.

Contents of faster_rcnn_resnet101_v1_640x640_coco17_tpu-8/checkpoint:

total 182M
-rw-r----- 1 jupyter jupyter  166 Jul 10 03:56 checkpoint
-rw-r----- 1 jupyter jupyter 182M Jul 10 03:58 ckpt-0.data-00000-of-00001
-rw-r----- 1 jupyter jupyter 8.7K Jul 10 03:56 ckpt-0.index

Now, upon launching the training I am getting:

Traceback (most recent call last):
  File "/home/jupyter/models/research/object_detection/model_main_tf2.py", line 106, in <module>
    tf.compat.v1.app.run()
  File "/home/jupyter/.local/bin/.virtualenvs/tfod_api/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/home/jupyter/.local/bin/.virtualenvs/tfod_api/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/jupyter/.local/bin/.virtualenvs/tfod_api/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/home/jupyter/models/research/object_detection/model_main_tf2.py", line 103, in main
    use_tpu=FLAGS.use_tpu)
  File "/home/jupyter/.local/bin/.virtualenvs/tfod_api/lib/python3.7/site-packages/object_detection/model_lib_v2.py", line 569, in train_loop
    ckpt.restore(latest_checkpoint)
  File "/home/jupyter/.local/bin/.virtualenvs/tfod_api/lib/python3.7/site-packages/tensorflow/python/training/tracking/util.py", line 2009, in restore
    status = self._saver.restore(save_path=save_path)
  File "/home/jupyter/.local/bin/.virtualenvs/tfod_api/lib/python3.7/site-packages/tensorflow/python/training/tracking/util.py", line 1304, in restore
    checkpoint=checkpoint, proto_id=0).restore(self._graph_view.root)
  File "/home/jupyter/.local/bin/.virtualenvs/tfod_api/lib/python3.7/site-packages/tensorflow/python/training/tracking/base.py", line 209, in restore
    restore_ops = trackable._restore_from_checkpoint_position(self)  # pylint: disable=protected-access
  File "/home/jupyter/.local/bin/.virtualenvs/tfod_api/lib/python3.7/site-packages/tensorflow/python/training/tracking/base.py", line 907, in _restore_from_checkpoint_position
    tensor_saveables, python_saveables))
  File "/home/jupyter/.local/bin/.virtualenvs/tfod_api/lib/python3.7/site-packages/tensorflow/python/training/tracking/util.py", line 289, in restore_saveables
    validated_saveables).restore(self.save_path_tensor)
  File "/home/jupyter/.local/bin/.virtualenvs/tfod_api/lib/python3.7/site-packages/tensorflow/python/training/saving/functional_saver.py", line 281, in restore
    restore_ops.update(saver.restore(file_prefix))
  File "/home/jupyter/.local/bin/.virtualenvs/tfod_api/lib/python3.7/site-packages/tensorflow/python/training/saving/functional_saver.py", line 103, in restore
    restored_tensors, restored_shapes=None)
  File "/home/jupyter/.local/bin/.virtualenvs/tfod_api/lib/python3.7/site-packages/tensorflow/python/distribute/values.py", line 647, in restore
    for v in self._mirrored_variable.values))
  File "/home/jupyter/.local/bin/.virtualenvs/tfod_api/lib/python3.7/site-packages/tensorflow/python/distribute/values.py", line 647, in <genexpr>
    for v in self._mirrored_variable.values))
  File "/home/jupyter/.local/bin/.virtualenvs/tfod_api/lib/python3.7/site-packages/tensorflow/python/distribute/values.py", line 392, in _assign_on_device
    return variable.assign(tensor)
  File "/home/jupyter/.local/bin/.virtualenvs/tfod_api/lib/python3.7/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 846, in assign
    self._shape.assert_is_compatible_with(value_tensor.shape)
  File "/home/jupyter/.local/bin/.virtualenvs/tfod_api/lib/python3.7/site-packages/tensorflow/python/framework/tensor_shape.py", line 1117, in assert_is_compatible_with
    raise ValueError("Shapes %s and %s are incompatible" % (self, other))
**ValueError: Shapes (4,) and (91,) are incompatible**

Am I missing something?

3. Steps to reproduce

If needed I can supply the tfrecords files.

4. Expected behavior

The training should not raise the shape mismatch error given I have done everything correctly before launching the training.

5. Additional context

None

6. System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
  • Mobile device name if the issue happens on a mobile device:
  • TensorFlow installed from (source or binary): Binary
  • TensorFlow version (use command below): 2.2.0
  • Python version: Python 3.7
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: 10.1
  • GPU model and memory: Tesla T4

I received the same error message when I had previously started a run with a wrong config file and forgot to remove the checkpoint from that run. Once I made sure the config is correct and the checkpoint directory empty it worked for me.

@cotrane I am actually fine-tuning, so I guess the checkpoint directory MODEL_DIR shouldn't be empty, rather it should contain the checkpoints from which I am fine-tuning. Please correct me if I am wrong.

I figured out where I went wrong. I should not have specified the pre-trained checkpoint path to MODEL_DIR. I just provided an absolute path of an empty directory and it worked!

Thanks!