YvanYin/Metric3D

How to use custom camera model and camera parameters using onnx inference

alpereninci opened this issue · 5 comments

I have a custom camera with intrinsic calibration. I have problem with onnx inference. Should I give "cam_model" parameter to model or Is using post process enough?

I think post process is not complete in test_onnx.py.

I would like to inference with "metric3d_vit_small" model.

I see in "do_test.py"

ori_focal = (intrinsic[0] + intrinsic[1]) / 2
canonical_focal = canonical_space['focal_length']
cano_label_scale_ratio = canonical_focal / ori_focal
..
rgb, _, pad, resize_label_scale_ratio = resize_for_input(rgb, forward_size, canonical_intrinsic, [ori_h, ori_w], 1.0)
label_scale_factor = cano_label_scale_ratio * resize_label_scale_ratio

in vit.raft5.small.py config file

max_value = 200
# configs of the canonical space
data_basic=dict(
    canonical_space = dict(
        # img_size=(540, 960),
        focal_length=1000.0,
    ),
    depth_range=(0, 1),
    depth_normalize=(0.1, max_value),
    crop_size = (616, 1064),  # %28 = 0
     clip_depth_range=(0.1, 200),
    vit_size=(616,1064)
) 

During onnx inference should I use canonical_space['focal_length'] = 1000 (comes from config) and normalize scale = 1 ( (comes from config)).
How to use cx and cy? Are there important parameters?

Also, what should I do if I change input resolution? Given input resolution is (H,W) = (616, 1064). What would happen and what should I do if I downsample the image resolution to (308, 532)

@Owen-Liuyuxuan does onnx support a custom camera?

@alpereninci cc: @YvanYin
For now, the onnx scripts in this repo and the provided onnx model do not directly support a custom camera. So we may have to compute the scale outside the onnx computation.

In ros2_vision_inferece, I make a demonstration of how to put camera matrix $P$ as another input to the onnx models and get scaled depth for any perspective cameras.

I trimmed down the projection and coordinate transform codes from ros2_vision_inference to showcase the changes we need:

## Change the model export script
class Metric3DExportModel(torch.nn.Module):
    def __init__(self, meta_arch, is_export_rgb=True):
        super().__init__()
        self.meta_arch = meta_arch
        self.register_buffer('rgb_mean', torch.tensor([123.675, 116.28, 103.53]).view(1, 3, 1, 1).cuda())
        self.register_buffer('rgb_std', torch.tensor([58.395, 57.12, 57.375]).view(1, 3, 1, 1).cuda())
        self.input_size = (616, 1064)

    def normalize_image(self, image):
        image = image - self.rgb_mean
        image = image / self.rgb_std
        return image

    def forward(self, image, P):
        original_image = image.clone()
        image = self.normalize_image(image)
        with torch.no_grad():
            pred_depth, confidence, output_dict = self.meta_arch.inference({'input': image})
            canonical_to_real_scale = (P[:, 0, 0, None, None] + P[:, 1, 1, None, None] ) / 2*1000.0 # 1000.0 is the focal length of canonical camera
            print(canonical_to_real_scale.shape, pred_depth.shape)
            pred_depth = pred_depth * canonical_to_real_scale # now the depth is metric
        return pred_depth

## In testing
dummy_P = np.zeros([1, 3, 4], dtype=np.float32)
outputs = ort_session.run(None, {"image": dummy_image, "P": dummy_P})

Any reshape operations before getting the (616, 1064) should be accompanied by changes in the camera matrix $P$.

But if you are considering changing the input size to the network, I have not succeeded in doing so myself, I am afraid there could be errors inside the ViT network? @YvanYin Any ideas on changing the input to the network?

Thanks for your reply @Owen-Liuyuxuan. Actually, I am considering changing the input size to network.

By slightly changing the input shape to the network, the onnx model works (at least with no weird errors). However, I am not sure about the generalization ability and the metric accuracy. I will give it a test.

I have tried on personal data. It works but the canonical camera focal length is not necessarily 500. I believe you could try to fine-tune the parameters for your usage.

For my scene, it is about 1000/sqrt(2), I don't know why though.

BTW, for TensorRT usage, we need to clean up the cache every time before constant parameter changes can apply, so I suggest doing tuning in GPU first.