Rank error in function misc.resolve_shape when trying to use flag upsample_logits

Question

Rank error in function misc.resolve_shape when trying to use flag upsample_logits

kmkajak opened this issue 2 years ago · 2 comments

I am trying to use the repository to do pose estimation on my own dataset.

First of all, everything seems to work fine in check_train_input.py, train.py, eval.py, and infer.py with the following parameters in params.yml:

#Dataset.
dataset: "sphere"

#Model.
model_variant: "xception_65"
atrous_rates: [12, 24, 36]
encoder_output_stride: 8
decoder_output_stride: [4]
upsample_logits: false
frag_seg_agnostic: false
frag_loc_agnostic: false
num_frags: 64

#Establishing correspondences.
corr_min_obj_conf: 0.1
corr_min_frag_rel_conf: 0.5
corr_project_to_model: false

#Training.
train_tfrecord_names: ["sphere_train-blender"]
train_max_height_before_crop: 128
train_crop_size: "128,128"
optimizer: "AdamOptimizer"
save_interval_steps: 10000
initialize_last_layer: false
fine_tune_batch_norm: false
train_steps: 4500000
train_batch_size: 4
base_learning_rate: 0.0001
obj_cls_loss_weight: 1.0
frag_cls_loss_weight: 1.0
frag_loc_loss_weight: 100.0
train_knn_frags: 1
data_augmentations:
  random_adjust_brightness:
    min_delta: -0.15
    max_delta: 0.15
  random_adjust_contrast:
    min_delta: 0.85
    max_delta: 1.15
  random_adjust_saturation:
    min_delta: 0.85
    max_delta: 1.15
  random_adjust_hue:
    max_delta: 1.0
  random_blur:
    max_sigma: 1.5
  random_gaussian_noise:
    max_sigma: 0.03
  jpeg_artifacts:
    min_quality: 85

However, when I enable the upsample_logits flag, I get the following error:

Traceback (most recent call last):
  File "/home/user/miniconda/envs/eposaidev/lib/python3.8/site-packages/tensorflow_core/python/framework/tensor_shape.py", line 928, in merge_with
    self.assert_same_rank(other)
  File "/home/user/miniconda/envs/eposaidev/lib/python3.8/site-packages/tensorflow_core/python/framework/tensor_shape.py", line 982, in assert_same_rank
    raise ValueError("Shapes %s and %s must have the same rank" %
ValueError: Shapes (?, 128, 128) and (?, ?, ?, ?) must have the same rank

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user/miniconda/envs/eposaidev/lib/python3.8/site-packages/tensorflow_core/python/framework/tensor_shape.py", line 1013, in with_rank
    return self.merge_with(unknown_shape(rank=rank))
  File "/home/user/miniconda/envs/eposaidev/lib/python3.8/site-packages/tensorflow_core/python/framework/tensor_shape.py", line 934, in merge_with
    raise ValueError("Shapes %s and %s are not compatible" % (self, other))
ValueError: Shapes (?, 128, 128) and (?, ?, ?, ?) are not compatible

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 584, in <module>
    tf.app.run()
  File "/home/user/miniconda/envs/eposaidev/lib/python3.8/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/home/user/miniconda/envs/eposaidev/lib/python3.8/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/home/user/miniconda/envs/eposaidev/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "train.py", line 505, in main
    train_tensor, summary_op = _train_epos_model(
  File "train.py", line 374, in _train_epos_model
    loss = _tower_loss(
  File "train.py", line 285, in _tower_loss
    _build_epos_model(
  File "train.py", line 202, in _build_epos_model
    loss.add_obj_cls_loss(
  File "/home/user/phd/epos/epos_lib/loss.py", line 131, in add_obj_cls_loss
    targets_shape = misc.resolve_shape(targets, 4)[1:3]
  File "/home/user/phd/epos/epos_lib/misc.py", line 44, in resolve_shape
    shape = tensor.get_shape().with_rank(rank).as_list()
  File "/home/user/miniconda/envs/eposaidev/lib/python3.8/site-packages/tensorflow_core/python/framework/tensor_shape.py", line 1015, in with_rank
    raise ValueError("Shape %s must have rank %d" % (self, rank))
ValueError: Shape (?, 128, 128) must have rank 4

I tried multiple sources of data, including the tfrecord file ycbv_test_targets-bop19.tfrecord provided by the authors, so I at least have some confidence that the data is not the source of the issue. However, I am not an in-depth expert on this repository and have not yet traced the entire path of the data through the code up until this point of failure.

Any clues or insights as to what the shapes should be like at this point of failure? Appreciate the help.

Answer 1 · 2023-01-02T10:57:36.000Z

Hi @kmkajak, as far as I remember, I always kept upsample_logits set to False, so unfortunately I am not familiar with this issue. Is it necessary to upsample the logits in your case?

Answer 2 · 2023-01-02T11:52:03.000Z

Hi @thodan , I have a dataset where the object appears at high range variations, meaning it can be 5 - 30 meters from the camera, which can cause the target to only show a small amount of pixels on the output side. I was hoping that maybe I would not have to use an object detector to solve this problem and therefore I have been playing with hyperparameters to see what can be achieved without it. Upsampling to input image size seems to solve some of that as post-processing then has more information to work with.

I managed to get upsampling to work now by fixing the resizing commands to the appropriate rank and selecting the appropriate dimensions. However, I have yet to verify conclusively if the results are indeed correct. It SEEMS to work and a test run of training SEEMS to verify the changes.

The changes:

in the function add_obj_cls_loss within module loss:

if upsample_logits:
      targets_shape = misc.resolve_shape(targets,3)[1:3]

in the function add_frag_cls_loss within module loss:

if upsample_logits:
      logits = resize_logits(logits, shape[1:3])

in the function add_frag_loc_loss within module loss:

if upsample_logits:
      logits = resize_logits(logits, shape[1:3])

In conclusion, within add_obj_cls_loss the wrong rank (originally 4, now 3) was specified, and within all three loss functions too many dimensions were previously selected for resizing the feature maps.