object_detection Shape mismatch after adjusting the pipeline config, anything missing?
Closed this issue · 3 comments
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- I am using the latest TensorFlow Model Garden release and TensorFlow 2.
- I am reporting the issue to the correct repository. (Model Garden official or research directory)
- I checked to make sure that this issue has not already been filed.
1. The entire URL of the file you are using
Base config file: https://raw.githubusercontent.com/tensorflow/models/master/research/object_detection/configs/tf2/faster_rcnn_resnet101_v1_640x640_coco17_tpu-8.config
Base checkpoints: http://download.tensorflow.org/models/object_detection/tf2/20200711/faster_rcnn_resnet101_v1_640x640_coco17_tpu-8.tar.gz
2. Describe the bug
This is related to fine-tuning a detection model, in my case, it is based on Faster R-CNN. I have changed all the necessary parameters in the configuration file with respect to the dataset I have. The config file can be found here.
I am launching the training locally with the following command:
python ~/models/research/object_detection/model_main_tf2.py \
--pipeline_config_path=${PIPELINE_CONFIG_PATH} \
--model_dir=${MODEL_DIR} \
--alsologtostderrv
Where, PIPELINE_CONFIG_PATH
is defined as home/jupyter/ssds_and_rcnn/lisa/experiments/training/faster_rcnn_resnet101_v1_640x640_coco17_tpu-8.config
and MODEL_DIR
is defined as /home/jupyter/ssds_and_rcnn/lisa/experiments/training/faster_rcnn_resnet101_v1_640x640_coco17_tpu-8/checkpoint
.
Contents of faster_rcnn_resnet101_v1_640x640_coco17_tpu-8/checkpoint
:
total 182M
-rw-r----- 1 jupyter jupyter 166 Jul 10 03:56 checkpoint
-rw-r----- 1 jupyter jupyter 182M Jul 10 03:58 ckpt-0.data-00000-of-00001
-rw-r----- 1 jupyter jupyter 8.7K Jul 10 03:56 ckpt-0.index
Now, upon launching the training I am getting:
Traceback (most recent call last):
File "/home/jupyter/models/research/object_detection/model_main_tf2.py", line 106, in <module>
tf.compat.v1.app.run()
File "/home/jupyter/.local/bin/.virtualenvs/tfod_api/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/jupyter/.local/bin/.virtualenvs/tfod_api/lib/python3.7/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/jupyter/.local/bin/.virtualenvs/tfod_api/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/home/jupyter/models/research/object_detection/model_main_tf2.py", line 103, in main
use_tpu=FLAGS.use_tpu)
File "/home/jupyter/.local/bin/.virtualenvs/tfod_api/lib/python3.7/site-packages/object_detection/model_lib_v2.py", line 569, in train_loop
ckpt.restore(latest_checkpoint)
File "/home/jupyter/.local/bin/.virtualenvs/tfod_api/lib/python3.7/site-packages/tensorflow/python/training/tracking/util.py", line 2009, in restore
status = self._saver.restore(save_path=save_path)
File "/home/jupyter/.local/bin/.virtualenvs/tfod_api/lib/python3.7/site-packages/tensorflow/python/training/tracking/util.py", line 1304, in restore
checkpoint=checkpoint, proto_id=0).restore(self._graph_view.root)
File "/home/jupyter/.local/bin/.virtualenvs/tfod_api/lib/python3.7/site-packages/tensorflow/python/training/tracking/base.py", line 209, in restore
restore_ops = trackable._restore_from_checkpoint_position(self) # pylint: disable=protected-access
File "/home/jupyter/.local/bin/.virtualenvs/tfod_api/lib/python3.7/site-packages/tensorflow/python/training/tracking/base.py", line 907, in _restore_from_checkpoint_position
tensor_saveables, python_saveables))
File "/home/jupyter/.local/bin/.virtualenvs/tfod_api/lib/python3.7/site-packages/tensorflow/python/training/tracking/util.py", line 289, in restore_saveables
validated_saveables).restore(self.save_path_tensor)
File "/home/jupyter/.local/bin/.virtualenvs/tfod_api/lib/python3.7/site-packages/tensorflow/python/training/saving/functional_saver.py", line 281, in restore
restore_ops.update(saver.restore(file_prefix))
File "/home/jupyter/.local/bin/.virtualenvs/tfod_api/lib/python3.7/site-packages/tensorflow/python/training/saving/functional_saver.py", line 103, in restore
restored_tensors, restored_shapes=None)
File "/home/jupyter/.local/bin/.virtualenvs/tfod_api/lib/python3.7/site-packages/tensorflow/python/distribute/values.py", line 647, in restore
for v in self._mirrored_variable.values))
File "/home/jupyter/.local/bin/.virtualenvs/tfod_api/lib/python3.7/site-packages/tensorflow/python/distribute/values.py", line 647, in <genexpr>
for v in self._mirrored_variable.values))
File "/home/jupyter/.local/bin/.virtualenvs/tfod_api/lib/python3.7/site-packages/tensorflow/python/distribute/values.py", line 392, in _assign_on_device
return variable.assign(tensor)
File "/home/jupyter/.local/bin/.virtualenvs/tfod_api/lib/python3.7/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 846, in assign
self._shape.assert_is_compatible_with(value_tensor.shape)
File "/home/jupyter/.local/bin/.virtualenvs/tfod_api/lib/python3.7/site-packages/tensorflow/python/framework/tensor_shape.py", line 1117, in assert_is_compatible_with
raise ValueError("Shapes %s and %s are incompatible" % (self, other))
**ValueError: Shapes (4,) and (91,) are incompatible**
Am I missing something?
3. Steps to reproduce
If needed I can supply the tfrecords files.
4. Expected behavior
The training should not raise the shape mismatch error given I have done everything correctly before launching the training.
5. Additional context
None
6. System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
- Mobile device name if the issue happens on a mobile device:
- TensorFlow installed from (source or binary): Binary
- TensorFlow version (use command below): 2.2.0
- Python version: Python 3.7
- Bazel version (if compiling from source):
- GCC/Compiler version (if compiling from source):
- CUDA/cuDNN version: 10.1
- GPU model and memory: Tesla T4
I received the same error message when I had previously started a run with a wrong config file and forgot to remove the checkpoint from that run. Once I made sure the config is correct and the checkpoint directory empty it worked for me.
@cotrane I am actually fine-tuning, so I guess the checkpoint directory MODEL_DIR
shouldn't be empty, rather it should contain the checkpoints from which I am fine-tuning. Please correct me if I am wrong.
I figured out where I went wrong. I should not have specified the pre-trained checkpoint path to MODEL_DIR
. I just provided an absolute path of an empty directory and it worked!
Thanks!