Question for train custom dataset
Opened this issue · 1 comments
Hello, thank you for sharing such an amazing code repository. I encountered a few issues while using this repository to train my own dataset. My training set has ten thousand images, with only one target per image.
- Does the clip_max_norm in the training parameters need to be adjusted? During training, the magnitude displayed by my code ranges from 12 to 20, is this a problem?
- When I tested, I found that the rotation and translation errors were 69 and 0.59 respectively. In my own labels, the rotation target is a 3x3 matrix and the translation is a 1x3 matrix. When checking the model, I found that the dimensions for rotation and translation were set to 3 and 6. Do I need to adjust this?
- I noticed that the backbone in the code is not trainable. Can I make the backbone trainable by passing the lr_backbone parameter?
Looking forward to hearing from you.
Best regards,
Hello @majx1997!
Thank your for your interest in PoET.
-
The gradient norms that you observe, are they with or without clipping? In general, the clipping ensures that you do not have huge adjustments to your network weights after each training batch. While this increases training duration, it ensures better convergence. You can definitely tune this parameter by observing the usual gradient and then choosing an appropriate value. Pascanu et al. talk about this in their paper. Personally, I never had to adjust this parameter to make my network train.
-
Hard to tell from the average errors how good the network performs. For how long did you train? And what is the general distance between the camera and objects in your dataset? The rotation error might also be due to rotation symmetries from your object. Regarding the translation and rotation dimensions you do not have to adjust anything, you just need to make sure that during the loss calculation the dimension fit together. PoET returns for each image in the batch and for each object query an 3 dimensional translation vector, i.e. (bs, n_q, 3). The 6D rotation representation is simply chosen for training properties. However, this is transformed at the end of the forward pass into a 3x3 roation matrix. PoET returns the rotation, similar to the rotation, as (bs, n_q, 3, 3). You simply need to ensure that the dimensions and associations match in the loss calculation.
-
In theory, by setting the learning rate to larger than 0 will make the backbone trainable. However, the current loss functions will not ensure that the backbone network will learn the object detection task. Hence, making the backbone trainable will only adapt the weights more towards the 6D pose estimation task. It will not improve with respect to detecting objects of interest in the image. You will need to add additional loss functions for object detection.
Hope this helps you! Let me know if you need any additional help!
Best,
Thomas