liyi14/mx-DeepIM

AssertionError: wx and wy should be equal

eyildiz-ugoe opened this issue · 18 comments

During the training session it throws an error as such:

Epoch[0] Batch [140]	Speed: 5.32 samples/sec	Train-Flow_L2Loss=0.000000,	Flow_CurLoss=0.000000,	PointMatchingLoss=15.519974,	MaskLoss=1.911539,	
Epoch[0] Batch [160]	Speed: 5.50 samples/sec	Train-Flow_L2Loss=0.000000,	Flow_CurLoss=0.000000,	PointMatchingLoss=15.912843,	MaskLoss=1.730901,	
Epoch[0] Batch [180]	Speed: 5.55 samples/sec	Train-Flow_L2Loss=0.000000,	Flow_CurLoss=0.000000,	PointMatchingLoss=16.262645,	MaskLoss=1.588497,	
batch 200: lr: 0.0001
Epoch[0] Batch [200]	Speed: 5.54 samples/sec	Train-Flow_L2Loss=0.000000,	Flow_CurLoss=0.000000,	PointMatchingLoss=16.708019,	MaskLoss=1.470834,	
Epoch[0] Batch [220]	Speed: 5.55 samples/sec	Train-Flow_L2Loss=0.000000,	Flow_CurLoss=0.000000,	PointMatchingLoss=19.187362,	MaskLoss=1.543989,	
Error in CustomOp.forward: Traceback (most recent call last):
  File "/home/username/.local/lib/python2.7/site-packages/mxnet/operator.py", line 987, in forward_entry
    aux=tensors[4])
  File "experiments/deepim/../../deepim/operator_py/zoom_flow.py", line 60, in forward
    assert wx == wy, 'wx and wy should be equal'
AssertionError: wx and wy should be equal

terminate called after throwing an instance of 'dmlc::Error'
  what():  [18:43:36] src/operator/custom/custom.cc:347: Check failed: reinterpret_cast<CustomOpFBFunc>( params.info->callbacks[kCustomOpForward])( ptrs.size(), const_cast<void**>(ptrs.data()), const_cast<int*>(tags.data()), reinterpret_cast<const int*>(req.data()), static_cast<int>(ctx.is_train), params.info->contexts[kCustomOpForward]) 

Stack trace returned 8 entries:
[bt] (0) /home/username/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x37b172) [0x7f1d99d47172]
[bt] (1) /home/username/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x37b738) [0x7f1d99d47738]
[bt] (2) /home/username/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x56dcd1) [0x7f1d99f39cd1]
[bt] (3) /home/username/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x5885c1) [0x7f1d99f545c1]
[bt] (4) /home/username/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x56ebc6) [0x7f1d99f3abc6]
[bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd8f0) [0x7f1e225cc8f0]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f1e26b606db]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f1e26e9988f]


Aborted (core dumped)

Any idea what's going on?

I'm facing the same problem when training the model. Were you able to fix the problem @eyildiz-ugoe @liyi14 @wangg12 ?

I think it is because the learning rate you set is too high or you removed the flow loss, which makes the rendered object so wrong that it is out of current image. To validate this, you can add an assert here

Thanks a lot for the reply.
The weird thing is that I did not change anything in the reference implementation yet.

For some reason the flow loss does not decrease during training. Should I lower the pre-set learning rate?

@Cryptiex Have you checked your loaded data, e.g. via visualization?

During the training session it throws an error as such:

Epoch[0] Batch [140]	Speed: 5.32 samples/sec	Train-Flow_L2Loss=0.000000,	Flow_CurLoss=0.000000,	PointMatchingLoss=15.519974,	MaskLoss=1.911539,	
Epoch[0] Batch [160]	Speed: 5.50 samples/sec	Train-Flow_L2Loss=0.000000,	Flow_CurLoss=0.000000,	PointMatchingLoss=15.912843,	MaskLoss=1.730901,	
Epoch[0] Batch [180]	Speed: 5.55 samples/sec	Train-Flow_L2Loss=0.000000,	Flow_CurLoss=0.000000,	PointMatchingLoss=16.262645,	MaskLoss=1.588497,	
batch 200: lr: 0.0001
Epoch[0] Batch [200]	Speed: 5.54 samples/sec	Train-Flow_L2Loss=0.000000,	Flow_CurLoss=0.000000,	PointMatchingLoss=16.708019,	MaskLoss=1.470834,	
Epoch[0] Batch [220]	Speed: 5.55 samples/sec	Train-Flow_L2Loss=0.000000,	Flow_CurLoss=0.000000,	PointMatchingLoss=19.187362,	MaskLoss=1.543989,	
Error in CustomOp.forward: Traceback (most recent call last):
  File "/home/username/.local/lib/python2.7/site-packages/mxnet/operator.py", line 987, in forward_entry
    aux=tensors[4])
  File "experiments/deepim/../../deepim/operator_py/zoom_flow.py", line 60, in forward
    assert wx == wy, 'wx and wy should be equal'
AssertionError: wx and wy should be equal

terminate called after throwing an instance of 'dmlc::Error'
  what():  [18:43:36] src/operator/custom/custom.cc:347: Check failed: reinterpret_cast<CustomOpFBFunc>( params.info->callbacks[kCustomOpForward])( ptrs.size(), const_cast<void**>(ptrs.data()), const_cast<int*>(tags.data()), reinterpret_cast<const int*>(req.data()), static_cast<int>(ctx.is_train), params.info->contexts[kCustomOpForward]) 

Stack trace returned 8 entries:
[bt] (0) /home/username/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x37b172) [0x7f1d99d47172]
[bt] (1) /home/username/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x37b738) [0x7f1d99d47738]
[bt] (2) /home/username/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x56dcd1) [0x7f1d99f39cd1]
[bt] (3) /home/username/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x5885c1) [0x7f1d99f545c1]
[bt] (4) /home/username/.local/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x56ebc6) [0x7f1d99f3abc6]
[bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd8f0) [0x7f1e225cc8f0]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f1e26b606db]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f1e26e9988f]


Aborted (core dumped)

Any idea what's going on?

Did you solverd the problem yet?

I'm facing the same problem when training the model. Were you able to fix the problem @eyildiz-ugoe @liyi14 @wangg12 ?

Did you solverd the problem yet?

I think it is because the learning rate you set is too high or you removed the flow loss, which makes the rendered object so wrong that it is out of current image. To validate this, you can add an assert here

You mean, if the predicted pose is worse, the rendered object out of the image?

But,

  • how should we constrain the rendering result under an arbitrary pose, to prevent it out from the image coordinate range?
  • If it is inevitable that the rendering objects out from the image coordinate range, how should we deal with it?

You can try to compute the 2D object bounding box without clipping it within the image size.

Thanks a lot for the reply.
The weird thing is that I did not change anything in the reference implementation yet.

For some reason the flow loss does not decrease during training. Should I lower the pre-set learning rate?

I tried to decrease the learning rate from 1e-4 to 5e-5, but it does not work.

You can try to compute the 2D object bounding box without clipping it within the image size.

Thanks for replying.

Could you please describe more details about how to solve it? Forgive me that I am not familiar with the implementation details of DeepIM currently, but I am really interested in it.

I guess you can check the loaded data through visualization.
To get the flow work, you need also check whether the gpu_flow calculator is working as expected (test flow), if not, you can simply disable it.

According to your advice, I run the test flow script(flow.py) and plot the gpu_flow result as follow. Since the corresdoning color images are not available from gt_obseved, I commented on the code for plotting src and tgt imgs. So the gpu_flow caculator works, right?

Figure_1

Yes. Then you might need to debug into the data loader and zooming operations.

Hi, AssertionError: wx and wy should be equal might be resulted in the wrong pose rendering. I mean it might be the rendering result out from the image coordinates range.

My question is that if the rendering out from the image is inevitable during training, how should I do?

You can try to compute the 2D object bounding box without clipping it within the image size.

As I said, you can compute the 2D object bbox without clipping it within the image size.

In this implementation, the box is obtained from mask, so it always clips the box and may result in an empty box.
However, you can directly get the box through projection.

You can try to compute the 2D object bounding box without clipping it within the image size.

As I said, you can compute the 2D object bbox without clipping it within the image size.

In this implementation, the box is obtained from mask, so it always clips the box and may result in an empty box.
However, you can directly get the box through projection.

Hi.

It seems that we find the solution.

As you said before, the rendered object so wrong that it is out of the current image. To avoid this, we change the train iteration number from 4 to 2, to make the training more easy and stable, and obtain an initial pre-trained model parameter. Based on this pre-trained model, we fine-tune the model and change the iteration num back to 4.

Glad you have solved it. BTW, a warmup strategy for iterations could also do the trick.