pvnieo/SURFMNet-pytorch

Loss goes to nan after 6 iterations

hearables-pkinsella opened this issue · 7 comments

I'm following through your examples and am having issues when I start training.

It starts off fine but after 6 iterations the loss turns to nan. Do you have any guidance on what could be causing this?

Hi @hearables-pkinsella ,

When did you clone the repo?

I cloned it yesterday. After reprocessing the training data the issue with the nans has gone away. The loss decreases from 100 to about 9. I had an issue where the loss then peaked back up to 60 after 5 epochs. I ran some tests on the results and its not working when compared to the tensorflow implementation. Do you have a sample test script that you use?

Did you train it using the same hyperparameters and epochs as the the tensorflow version? how did you evaluated the tensorflow version? normally, this should be the same as pytorch.

Also, the loss is not stable is normal for this network!

That's good to know that the loss not being stable is normal. I tested originally with your default parameters and tested the parameters from the original paper but the output result is still the same.

I am evaluating visually by generating a P2P map and looking at the correspondences. Our database is made of quite smooth curvature shapes that are all in the same coordinate system. I have attached a sample of our dataset, if you have time it would be appreciated if you can give some feedback on tuning the parameters.

3105_R.zip

So it turns out increasing the radius for the shot descriptor was the key to get some better results.

I noticed one issue that may be in my test script or in the model itself, if I set the torch model to eval, then the output is just the same number. Whereas if I set the torch model to train, then I get the correct correspondences out.

What batch size are you using during training?

Closing the issue!
feel free to open it if you have further questions!