Evaluation results very bad
Opened this issue · 4 comments
Hi, I'm trying to train PoET on my custom dataset, for sake of clarity, the dataset is composed by 30 video in which I have only one object which is always in the center of the image (it can be slightly moved but approximately it's in the center) and never occluded. As first step I trained YOLO and it works very good. However, the PoET transformer during the training both using backbone or ground-truth label is not learning at all. If we compare the evaluation metrics between the 5-th epoch and the 50-th one, the results are almost the same (in some case even worst). By looking at the losses during the training they're decreased along the epochs, however the results still are so bad. Did you ever experienced something like?
PS: As hyperparameters I'm using the one you provided for the YCB-Video dataset as my dataset is built with custom objects but with a similar configuration.
These are the losses:
{"train_lr": 0.0001999999999999853, "train_grad_norm": 71.56180365573567, "train_position_loss": 0.0004068389578922774, "train_rotation_loss": 0.7290082707034702, "train_loss": 3.76163222109715, "train_loss_trans": 0.0008136779157845548, "train_loss_rot": 0.7290082707034702, "train_loss_trans_0": 0.0006544853708065885, "train_loss_rot_0": 0.7750758395832922, "train_loss_trans_1": 0.0008348826280783751, "train_loss_rot_1": 0.7659609740001402, "train_loss_trans_2": 0.0009145029361920533, "train_loss_rot_2": 0.74827361520032, "train_loss_trans_3": 0.000919925991882703, "train_loss_rot_3": 0.7391760488487178, "train_loss_trans_unscaled": 0.0004068389578922774, "train_loss_rot_unscaled": 0.7290082707034702, "train_loss_trans_0_unscaled": 0.00032724268540329423, "train_loss_rot_0_unscaled": 0.7750758395832922, "train_loss_trans_1_unscaled": 0.00041744131403918756, "train_loss_rot_1_unscaled": 0.7659609740001402, "train_loss_trans_2_unscaled": 0.00045725146809602667, "train_loss_rot_2_unscaled": 0.74827361520032, "train_loss_trans_3_unscaled": 0.0004599629959413515, "train_loss_rot_3_unscaled": 0.7391760488487178, "epoch": 10, "n_parameters": 14047113}
{"train_lr": 0.0001999999999999853, "train_grad_norm": 49.230578642773246, "train_position_loss": 0.0002832899802765488, "train_rotation_loss": 0.3247749454875207, "train_loss": 1.772507746909583, "train_loss_trans": 0.0005665799605530976, "train_loss_rot": 0.3247749454875207, "train_loss_trans_0": 0.00048024131988974757, "train_loss_rot_0": 0.3864199985161464, "train_loss_trans_1": 0.0006219867309041784, "train_loss_rot_1": 0.3682782535760578, "train_loss_trans_2": 0.000699654193110162, "train_loss_rot_2": 0.3492575710207973, "train_loss_trans_3": 0.0006888218545745218, "train_loss_rot_3": 0.340719692684924, "train_loss_trans_unscaled": 0.0002832899802765488, "train_loss_rot_unscaled": 0.3247749454875207, "train_loss_trans_0_unscaled": 0.00024012065994487378, "train_loss_rot_0_unscaled": 0.3864199985161464, "train_loss_trans_1_unscaled": 0.0003109933654520892, "train_loss_rot_1_unscaled": 0.3682782535760578, "train_loss_trans_2_unscaled": 0.000349827096555081, "train_loss_rot_2_unscaled": 0.3492575710207973, "train_loss_trans_3_unscaled": 0.0003444109272872609, "train_loss_rot_3_unscaled": 0.340719692684924, "epoch": 40, "n_parameters": 14047113}
and these are the relative evaluations:
Epoch 10
-
---------------------------------------------------------------------------------------------------- *
Metric ADD(-S) -
---------------------------------------------------------------------------------------------------- *
** Unnamed-DUMMY#1-1 **threshold=[0.0, 0.10], area: 6.23
threshold=0.02, correct poses: 1.0, all poses: 3734.0, accuracy: 0.03
threshold=0.05, correct poses: 146.0, all poses: 3734.0, accuracy: 3.91
threshold=0.10, correct poses: 1271.0, all poses: 3734.0, accuracy: 34.04 -
---------------------------------------------------------------------------------------------------- *
Metric Average Rotation Error in Degrees -
---------------------------------------------------------------------------------------------------- *
Class: Unnamed-DUMMY#1-1 134.05327252676838
epoch 40
-
---------------------------------------------------------------------------------------------------- *
Metric ADD(-S) -
---------------------------------------------------------------------------------------------------- *
** Unnamed-DUMMY#1-1 **threshold=[0.0, 0.10], area: 5.83
threshold=0.02, correct poses: 16.0, all poses: 3734.0, accuracy: 0.43
threshold=0.05, correct poses: 135.0, all poses: 3734.0, accuracy: 3.62
threshold=0.10, correct poses: 1219.0, all poses: 3734.0, accuracy: 32.65 -
---------------------------------------------------------------------------------------------------- *
Metric Average Rotation Error in Degrees -
---------------------------------------------------------------------------------------------------- *
Class: Unnamed-DUMMY#1-1 136.038801572181
the meaningfulness of the ADD/ADD-S metric heavily depends on the dataset you are working with. If you check out the YCB-V dataset, the distance to the object lies within 60cm to 1m. In this case the ADD-S metric is meaningful as its upper limit for the error calculation is 10cm. However, if for your dataset the objects are away 2m to 3m, then an error of 20cm would still be acceptable. In general I would advise to look at the average translation and rotation error to get a feeling how good your network performs. Then depending on the underlying situation you can consult the ADD/ADD-S metric.
Hope this helps!
Best,
Thomas
Hi, well actually I found out that the network was heavily overfitting. I tried to change some parameters such as the nheads and the number of encoding and decoding layers. Btw, it still overfits so I'm trying to generate more samples for the Dataset.
Idk if you've some advice about the Dataset on how to make it more suitable for poet
Could you provide me a general overview of your dataset? How many images, how many objects, etc.?
Well, up to now I tried to train it on approximately 100k images in which there was only one object (I supposed that it could be useful to start with a very simple case). I trained poet using the ground truth mode. Each frame has a background randomly sampled from 30 images and the object pose is completely random.
I think that the main problem could be that having only one object to regress is too simple, so today I'll generate a new dataset with 14 objects such that in each frame I have from 3 up to 7 objects.