Motivation for the RandomTransformSpace augmentation
fuy34 opened this issue · 4 comments
Hi,
May I ask what the motivation is to perform the random rotation and translation of the tsdf volume during training?
Line 335 in bee3ddb
I understand we may need to padding and cropping the ground truth tsdf volume to fit the pre-defined training volume size. However, why shall we rotate and translate it?
I thought it was for the pose error caused by the BundleFusion. But is there such a big error, 360 degree rotation along z-axis and 3 meter translation, according to the value in the code?
Line 229 in bee3ddb
Line 259 in bee3ddb
Or it is just for the model to adjust to different world coordinate system? I may miss something here. Any insight is appreciated.
Thank you in advances!
Good question. This is used for data augmentation similar to how random resize, crops and rotations are used in typical image tasks. Here we want our model to be invariant to 3d rotations around the gravity direction as well as translations in space (these coordinates are only defined up to an arbitrary choice of world origin).
Note that if your slam system giving you pose does not resolve the gravity direction (for example COLMAP) you may want to also add augmentations for arbitrary 3d rotations. The downside to allowing arbitrary orientations is that it is harder for the network to learn priors like the floor is always flat horizontal. You can do a similar augmentation for scale, but here you can only support a range of scales that make sense for the chosen voxel size.
The augmentations here will not help with errors in BundleFusion since the augmentation is being applied to the world coordinate system, not the pose of each frame. I have thought about also adding a small random rotation/translation to each frame's pose to help be robust to errors from SLAM as you suggested, but I have not had time to try it yet. Naively adding this augmentation might help a bit, but since the backprojection and accumulation operators are not learnable, the network probably doesn't have the capacity to really take advantage of this augmentation. I think there are some interesting research directions here.
I think I may get the idea. Could I say in this way?
I know the full process will generate more data because the random transformation plus the cropping will get a sort of random cropping effect. If we do not apply the transformation, this function will always do center cropping, if I understand the code correctly. At meanwhile, the random transformation is applied to the frame and the mesh jointly, so the relative poses among them have never been changed. As we take the image as input and project their deep feature to the 3D space, the 3D feature volume will have the same transformation in the 3D space. Am I right?
However, unlike the 2D image case, the empty voxels will not be filled with 0, from the 3D backbone's view. In the prediction time, the place that should be originally out of the scene volume will get the same feature as those valid ones, due to the feature projection. Will it mislead the network to some wrong predictions at those out of the scene spaces (it comes into the view because a part of the scene is transformed to the outside of the view volume)?
Oho, is this the motivation you sparse the higher level prediction
Line 102 in bee3ddb
and constrain the semantic loss on the validate ground truth area?
Line 203 in bee3ddb
I think you have the right idea in your first paragraph. We are simply applying a rigid transformation to the entire world coordinate system. All camera poses and ground truth geometry are moved consistently. Thus the rays through each pixel still pass through the correct ground truth voxels. The only reason this has any affect on anything is that we are discretizing world space coordinates (x,y,z) into discrete and finite voxel coordinates (i,j,k). In theory a convolutional network should be invariant to the translations, but in practice, due to edge/boundary effects, they are not and thus these augmentations help generalization (convolutional networks are not intrinsically rotation invariant but we would like it to be in this case).
I am not sure what you mean in your second paragraph. Everything is transformed consistently so there are no issues here. It is true that each time we create a finite voxel volume we are cropping the infinite world. During training it is common for large scenes to only partially fit in the volume but this is not a problem. In fact training on a volume where the observed geometry is entirely outside the voxel volume still provides useful training signal to the network (as long as some of the rays still pass through the volume). In this case the network must learn that this is empty space.
The motivation for the split_loss
is to help multi scale training. Without it, as the resolution increases, the majority of the voxels are easily classified as empty. These large number of easy examples dominate the loss hindering learning of fine details. We also log transform the TSDF values to help with this, although something like a focal loss from object detection could also be used.
The semantic loss only makes sense on the surface (we do not have semantic labels for empty space). An equivalent formulation would be to label the empty space with the semantic ignore_index
class. An Alternative could be to add an extra semantic class that is empty space.
I see. I guess I got your idea.
What I meant was shown in the picture below. Please excuse me for my ugly drawing.
Saying the blue rectangle represents the room after transformation, and the red triangle is the camera. If there is a pixel feature projected along with the red ray on the right side, and no other rays intersect with it (due to the frame selection, for example). The voxel 1-7 will have the same feature. However, according to the ground truth, voxel 1 and 2 should predict the tsdf value between -1 to 1, while the others should get 1 or larger. This means we are expecting different values with the same feature.
But I just realized, as long as the 3D backbone has a fairly large receptive field, it can infer the value with the context, instead of limited itself with the initial voxel feature. And as you said, there should be some pattern even in the far away empty space to tell the network it is empty.
Thank you so much for your detailed explanation! This work is awesome! Hopefully, I can build something upon it. :p