j96w/DenseFusion

Extremely bad overfit with real camera

Closed this issue · 20 comments

I have trained both with noise and without noise and the following happens:
When there is no noise, I get good result on the training and test loss, but afterwards when applying the result with a new video, the pose estimation (especially the orientation) is completely off. The only case where it works is where the new video is really similar to a show video.

I then trained using noise, but with noise (tryied 3,2,1 and 0.5 cm) the test loss remains extremely high (0.5 m for an object of 0.05 cm). Any idea on why this behavior?

I am training with 1000 images for the training and 4000 thousand images for the testing. I am sampling 1000 points and the training videos have been taken from different angles

Currently I am trainin with w parameter equal to 0.1, I suppose the network was able to get "low" loss in the training because was giving low probability to the estimates

j96w commented

Hi,

Could you make sure that the range of the object translation is close to the distribution of the YCB dataset? It won't work if the objects in your new video are far from the translation range of the training set since we are using the regression training. You also need to double check the camera intrinsic matrix of your new video.

Also, for your own dataset, you probably need to tune the w parameter so that the confidence score can reach a reasonable level (close to 1.0). Otherwise, the regression training of the pose just doesn't start.

May I know the reason of your 1k training and 4k (or 4000k?) testing data split? Seems like 1k training samples are not enough for the model to learn something.

Hello,
Thank you for answering.
I am investigating the sampling procedure from the mesh. Until now I was sampling with uniform sampling using open3d llibrary. Now after sampling I am also shuffling the points

j96w commented

Hi, the mesh is only used to calculate the loss. Actually, there is no need to sample or shuffle the points from the mesh.

Indeed, I did't get any improvement. I am trying now the following: instead of giving the video in temporal order, I am shuffling the frames so that they are in random order. I hope it helps.
The camera using for testing and training is the same, the light conditions identical

It is so weird, in the training dataset the model is able quickly to compute the right pose, but then in the testing phase the model always predict a pose that is rotated wrongly in along one axis. The x,y,z position are okey

Still no improvement. I will try the following: add noise on the depth camera input and change the 0 values of the death camera with a random number, so that the network cannot relies on 0 numbers

I added as well noise on the depth camera and things are a bit improved. The next step I thinking of is training with the output mask of the instance segmentation network instead of the gt

j96w commented

Hi, I have to say I failed get your point. Data augmentation is used to avoid overfitting and improve the performance. For your cases, I would suggest to first debug, especially the rotation error in one axis. Data aug won't fix your bug.

I am sorry if I was confusing. Let me start from scratch. I have recorded around 10 videos and created a dataset with the following repo: https://github.com/F2Wang/ObjectDatasetTools. Each video is somewhat different (different pose of the object and of the camera) but roughly the distance from the object is constant. Now I am using 8 videos for training, and then testing on the other two videos. In training I get sub-centimeter accuracy with no problem, while in testing I cannot go below 4 centimetre. (The position is okay, the pose is off). I do not thing is a bug, as all the videos have been recorded in the same way. It seems to me that the network is really overfitting on those 8 videos.

j96w commented

Yes, it's overfitting. How many frames does each video have? Could you get more frames or videos?

Improved by using data augmentation (rotation and translation of the image). �Now this error is around 0.015 in the testing set and 0.003 in the training set

Hi, I was going through this thread to understand how to use Densefusion for captured images. Is it necessary for camera parameters to be same for training and testing? Or is it possible to use the pre-trained released YCB dataset weights for the model with images taken from a camera with different parameters? I assumed that this should be possible since the camera params are only being used to get a point cloud

The camera parameters as you said, are you used to get the point cloud that is then fed into the network. If you switch between train and testing the camera parameters the network might not work anymore. Also keep in mind that if you use the YCB weights you will be able to just estimate the pose of objects really similar to the one in the YCB dataset

j96w commented

Hi @aditya2592, according to our testing, the released YCB weights still works quite well on another camera with different parameters. As you can see in our real robot grasping video demo, the HSR robot we are using has an Asus Xtion RGB-D sensor which has different parameters compared to the Kinect used to build the YCB dataset. And it still works. As you mentioned, since the camera params are only used to generate the pointcloud from the depth, I don't understand why it won't work if your camera is capturing the real depth. Two things are worth noting: (1) make sure your camera scale parameter is correct. (2) make sure the RGB and depth images are matched (For some sensors like RealSense, the RGB channel and the Depth channel need additional steps to maintain the pixel-level alignment. Just keep in mind this would affect the performance when using the released weights.)

Thanks for the confirmation @j96w. I think the YCB Video dataset videos were also captured with an Asus Xtion RGB-D sensor. That is what has been mentioned in their paper.

j96w commented

Hi @aditya2592, I'm not sure whether this would make it more clear:

There are two cameras used in YCB:
(1). 312.9869, 241.3109, 1066.778, 1067.487, scale = 1000.0 ,for the first 48 training videos and all test videos
(2).323.7872, 279.6921, 1077.836, 1078.189, scale = 1000.0 ,for the training videos after 60.

The camera on our HSR is: 321.24099379, 237.11014479, 537.99040688, 539.09122804, scale = 1.0. Which I think is not the same focal length.

j96w commented

@aditya2592 @TommasoBendinelli After a few days of thinking, I have changed my opinion on the second part of this issue. Please refer to Issue#150 for more information.

Improved by using data augmentation (rotation and translation of the image). �Now this error is around 0.015 in the testing set and 0.003 in the training set

I am currently having the exact overfitting situation. Would you be so kind to give us more details about how you do augmentation? Could you explain more about "rotation and translation of the image"?