nianticlabs/mickey

Training on custom dataset

Closed this issue · 6 comments

Hi, thanks for the great work!

I am trying to train Mickey on custom large-scale datasets, such as RealEstate10K. While I have successfully implemented the dataloader for Mickey, I am not sure if the training is being done correctly just by seeing the loss graphs or the depth map shown during training. Specifically, I am not seeing any monotonic decrease in the error metrics during training which is shown in the screen shot below.
image

I have tried training Mickey on the sample dataset of MapFree, but this case also showed a similar training signal. Are there any signals I can verify if my training is being done correctly?

Thanks!

Hello, thanks for your interest in our work!

What loss function are you using? If using the VCRE (with soft clipping) the loss should start at 1.0 and decrease to 0.3-0.2 by the end of the training. In the config file, you should see under the LOSS_CLASS section:
LOSS_FUNCTION: "VCRE"
SOFT_CLIPPING: True

One problem with the REINFORCE pipeline in MicKey is that it is slow to train, we have just pushed a new RANSAC that speeds up the training pipeline. Besides, we have added two configurations ( curriculum_learning_warm_up.yaml and overlap_score_warm_up.yaml), they use low-resolution images and do not include the null hypothesis (both changes should improve the converge speed).

As a reference, see the training curves when using overlap_score_warm_up.yaml:
Screenshot 2024-09-19 at 09 50 17

I would suggest using the latest version, and see whether you can replicate the training curves for Map-free. I am happy to help debug why RealEstate10K is not working after this sanity check.

Thanks!

Hi, thanks for the quick reply!
I have tried the latest version, but the loss value starts from a very small value from the beginning(around 1e-3) while the loss_rot is about 0.3 and loss_trans is about 1.5. Could this be due to the large batch sizes? Instead of using 4 GPUs I am training with 1 using the batch_size * 4 shown in the configuration file.

Hi, thanks for trying the new version that fast!

Are you training with the map-free data? You could use this config file for now, to verify that the pipeline works as expected. Set to 1 the GPU (the training curves I shared above are using only 1 GPU and batch size 24).

The tensorboard also stores keypoints, depths, and matches. Are you able to visualise any of those to see whether the network is generating something meaningful? If the loss is that low, I would expect something like the figure below:
Screenshot 2024-09-19 at 15 46 41

I am actually trying out with the realestate10k dataset, but I figure that this maybe due to the image pairs that are sampled have large overlaps making the problem too easy. The correspondence is not perfect but near correct from the beginning while the depth is not being learned.
The visualizations of the early steps of training are shown below.
image
image

Thank you very much for the help!

Hello, I think I know what the problem is, MicKey expects a metric dataset to train. As far as I know (do please correct me), RealEstate10K is an unscaled dataset, and then it cannot be trained with our REINFORCE pipeline out of the box.

One option could be to switch to pose supervision in the config file and then only use the rotation loss (not the translation). Hope this helps, do please reopen the issue if you have further questions.

Hi, thank you very much for the kind explanation!
I do think that could be the main problem! I will try only with the rotation loss.
Thank you again for the great work and kind help!

Best regards,
Jaewoo.