Lilac-Lee/PointNetLK_Revisited

Huge memory consumption for inference

gitouni opened this issue · 13 comments

Hi ,Lilac-Lee, I found Analytical PointNet-LK consumes much more CUDA memory than the standard PointNet-LK when testing. I have run your original codes for modelnet and 3dmatch. Actually, I have to restrict the maximum number of resampled points from datasets to almost 5000 to fit a 12GB GPU memory (for 3dmatch, num_points per voxel must be lower than 700), or there will be a Out of Mermory Error. Is that normal, or I went somehting wrong?

Hi, this is expected when you use a large number of points. The big headache is in the analytical feature Jacobian computation. Because the feature dimensions (in_feature size and out_feature size) are big, and the number of points is large, you will create a very big matrix (in our case, a big 4D matrix). So, you could restrict the number of points used to compute the Jacobian, in our case, we only used 100 points to compute the jacobian. Or, use some iterative aggregation as we proposed to aggregate Jacobian. Another solution might be to try manually multiplying and doing summation. I have seen in some specific cases that the einsum operation takes a lot of memory since it will automatically broadcast the dimension of the matrix to match the biggest number of dimensions of all elements....

Because it is based on PointNet, a pooling of all point-features produces the point cloud-feature. Thus before the pooling intermediary Jacobian is for each point separately.

So I think a fast modification is to mask out those points that participate to compute the point cloud-feature Jacobian. Which is at most C (feature dim) number of points. This will strongly reduce the space cost of the intermediary Jacobian.

Which put max_idx inside the feat_Jac and warp_Jac computation as an alternative to mask the J in the end.

Because it is based on PointNet, a pooling of all point-features produces the point cloud-feature. Thus before the pooling intermediary Jacobian is for each point separately.

So I think a fast modification is to mask out those points that participate to compute the point cloud-feature Jacobian. Which is at most C (feature dim) number of points. This will strongly reduce the space cost of the intermediary Jacobian.

Which put max_idx inside the feat_Jac and warp_Jac computation as an alternative to mask the J in the end.

Selecting points needed to be computed before the whole algorithm starts is a good idea, however, the max_idx from max pooling will vary from transfomation G because the input has been changed. As you can see, rotation augmentation will affect the accuracy of classification, so it is likely to affect max_idx in IC-LK module.

Hi, this is expected when you use a large number of points. The big headache is in the analytical feature Jacobian computation. Because the feature dimensions (in_feature size and out_feature size) are big, and the number of points is large, you will create a very big matrix (in our case, a big 4D matrix). So, you could restrict the number of points used to compute the Jacobian, in our case, we only used 100 points to compute the jacobian. Or, use some iterative aggregation as we proposed to aggregate Jacobian. Another solution might be to try manually multiplying and doing summation. I have seen in some specific cases that the einsum operation takes a lot of memory since it will automatically broadcast the dimension of the matrix to match the biggest number of dimensions of all elements....

Thank you very much for your reply. I found that random points selection (100 points) is implemented for mode train and val but not test. It is probably that few points involved in the analytical Jacobian Computation can meet the accuracy requirements but the way to select idx will affect the result. In the case of not knowing the point correspondence, the selected point may be an outlier and this usually happens in point cloud pairs with a low-overlapped ratio. Therefore, involving all points into calculation (once or using iterative aggregation or voxelization) is still the safest way for large-scale point cloud registration. The above are only personal opinions.

Selecting points needed to be computed before the whole algorithm starts is a good idea, however, the max_idx from max pooling will vary from transfomation G because the input has been changed. As you can see, rotation augmentation will affect the accuracy of classification, so it is likely to affect max_idx in IC-LK module.

max_idx is computed in the algorithm (it computes Jacobian once at the beginning. Refering model.py)
The pooling will anyway give a different result with different inputs.

The max_idx does not change the functionality.
Thus this modification does nothing to affect the accuracy.

Selecting points needed to be computed before the whole algorithm starts is a good idea, however, the max_idx from max pooling will vary from transfomation G because the input has been changed. As you can see, rotation augmentation will affect the accuracy of classification, so it is likely to affect max_idx in IC-LK module.

max_idx is computed in the algorithm (it computes Jacobian once at the beginning. Refering model.py) The pooling will anyway give a different result with different inputs.

The max_idx does not change the functionality. Thus this modification does nothing to affect the accuracy.

You're right, I forgot the tips mentioned in PointNet-LK, the Jacobian Matrix just need to be computed once. I will have a try.

Selecting points needed to be computed before the whole algorithm starts is a good idea, however, the max_idx from max pooling will vary from transfomation G because the input has been changed. As you can see, rotation augmentation will affect the accuracy of classification, so it is likely to affect max_idx in IC-LK module.

max_idx is computed in the algorithm (it computes Jacobian once at the beginning. Refering model.py) The pooling will anyway give a different result with different inputs.

The max_idx does not change the functionality. Thus this modification does nothing to affect the accuracy.

I modify the PointNet_features class in model.py but it raise a new error. Function of foward is modifed as follows:

x = points.transpose(1, 2) # [B, 3, N]
        if iter == -1:
            x = self.mlp1[0](x)
            A1_x = x
            x = self.mlp1[1](x)
            bn1_x = x
            x = self.mlp1[2](x)
            M1 = (x > 0).type(torch.float)
            
            x = self.mlp2[0](x)
            A2_x = x
            x = self.mlp2[1](x)
            bn2_x = x
            x = self.mlp2[2](x)
            M2 = (x > 0).type(torch.float)
            
            x = self.mlp3[0](x)
            A3_x = x
            x = self.mlp3[1](x)
            bn3_x = x
            x = self.mlp3[2](x)
            M3 = (x > 0).type(torch.float)
            max_idx = torch.max(x, -1)[-1]
            x = torch.nn.functional.max_pool1d(x, x.size(-1))
            x = x.view(x.size(0), -1)

            # extract weights....
            A1 = self.mlp1[0].weight
            A2 = self.mlp2[0].weight
            A3 = self.mlp3[0].weight
            max_id = max_idx.reshape(-1)
            M1,M2,M3 = M1[...,max_id], M2[...,max_id], M3[...,max_id]
            A1_x,A2_x,A3_x = A1_x[...,max_id], A2_x[...,max_id], A3_x[...,max_id]
            bn1_x,bn2_x,bn3_x = bn1_x[...,max_id], bn2_x[...,max_id], bn3_x[...,max_id]
            return x, [M1, M2, M3], [A1, A2, A3], [A1_x, A2_x, A3_x], [bn1_x, bn2_x, bn3_x], max_idx

However,it raise the error that:
dBN1 = torch.autograd.grad(outputs=BN1, inputs=Ax1, grad_outputs=torch.ones(BN1.size()).to(device), retain_graph=True)[0].unsqueeze(1).detach()
File "/home/bit/.local/lib/python3.8/site-packages/torch/autograd/init.py", line 223, in grad
return Variable._execution_engine.run_backward(
RuntimeError: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.

allow_unused=True will return NoneType

Hi, gitouni.

I think the problem is you forget to detach the tensor before masking with max_id.

Hi, gitouni.

I think the problem is you forget to detach the tensor before masking with max_id.

I have fixed this problem and verify this idea is correct. Thank you very much. I found that ptnet must be calculate twice for analytical Jacobian, first for max_idx extraction and second for Jacobian computation with respect to p[..., max_idx]. Beacuse max_idx is calculated from [M1, M2, M3], [A1, A2, A3], [A1_x, A2_x, A3_x], calculate them together will cause cycle graph problem.

Hi, gitouni.
I think the problem is you forget to detach the tensor before masking with max_id.

I have fixed this problem and verify this idea is correct. Thank you very much. I found that ptnet() must be called twice for analytical Jacobian, first for max_idx extraction and second for Jacobian computation with respect to p[..., max_idx]. Beacuse max_idx is calculated from [M1, M2, M3], [A1, A2, A3], [A1_x, A2_x, A3_x], calculate them together will cause cycle graph problem.

Because it is based on PointNet, a pooling of all point-features produces the point cloud-feature. Thus before the pooling intermediary Jacobian is for each point separately.

So I think a fast modification is to mask out those points that participate to compute the point cloud-feature Jacobian. Which is at most C (feature dim) number of points. This will strongly reduce the space cost of the intermediary Jacobian.

Which put max_idx inside the feat_Jac and warp_Jac computation as an alternative to mask the J in the end.

Hi, @Jarrome Another interesting question to ask you. As the author of PointNet-LK mentioned that avg symmetric function shows better performance than max symfun, is there similar approach to reduce memory on avg pooling Analytical Jacobian Computation.

Hi, @Jarrome Another interesting question to ask you. As the author of PointNet-LK mentioned that avg symmetric function shows better performance than max symfun, is there similar approach to reduce memory on avg pooling Analytical Jacobian Computation.

No way to losslessly reduce the computation and space if you use mean-pooling. All points contribute to the gradient.

However, you still can try random sampling or others as commented by @Lilac-Lee. As sampling affects less on mean-pooling.

Hi, @Jarrome Another interesting question to ask you. As the author of PointNet-LK mentioned that avg symmetric function shows better performance than max symfun, is there similar approach to reduce memory on avg pooling Analytical Jacobian Computation.

No way to losslessly reduce the computation and space if you use mean-pooling. All points contribute to the gradient.

However, you still can try random sampling or others as commented by @Lilac-Lee. As sampling affects less on mean-pooling.

Thanks for your insight, that was really a problem confused me for a long time.