How to use custom camera model and camera parameters using onnx inference
alpereninci opened this issue · 5 comments
I have a custom camera with intrinsic calibration. I have problem with onnx inference. Should I give "cam_model" parameter to model or Is using post process enough?
I think post process is not complete in test_onnx.py.
I would like to inference with "metric3d_vit_small" model.
I see in "do_test.py"
ori_focal = (intrinsic[0] + intrinsic[1]) / 2
canonical_focal = canonical_space['focal_length']
cano_label_scale_ratio = canonical_focal / ori_focal
..
rgb, _, pad, resize_label_scale_ratio = resize_for_input(rgb, forward_size, canonical_intrinsic, [ori_h, ori_w], 1.0)
label_scale_factor = cano_label_scale_ratio * resize_label_scale_ratio
in vit.raft5.small.py config file
max_value = 200
# configs of the canonical space
data_basic=dict(
canonical_space = dict(
# img_size=(540, 960),
focal_length=1000.0,
),
depth_range=(0, 1),
depth_normalize=(0.1, max_value),
crop_size = (616, 1064), # %28 = 0
clip_depth_range=(0.1, 200),
vit_size=(616,1064)
)
During onnx inference should I use canonical_space['focal_length'] = 1000 (comes from config) and normalize scale = 1 ( (comes from config)).
How to use cx and cy? Are there important parameters?
Also, what should I do if I change input resolution? Given input resolution is (H,W) = (616, 1064). What would happen and what should I do if I downsample the image resolution to (308, 532)
@Owen-Liuyuxuan does onnx support a custom camera?
@alpereninci cc: @YvanYin
For now, the onnx scripts in this repo and the provided onnx model do not directly support a custom camera. So we may have to compute the scale outside the onnx computation.
In ros2_vision_inferece, I make a demonstration of how to put camera matrix
I trimmed down the projection and coordinate transform codes from ros2_vision_inference
to showcase the changes we need:
## Change the model export script
class Metric3DExportModel(torch.nn.Module):
def __init__(self, meta_arch, is_export_rgb=True):
super().__init__()
self.meta_arch = meta_arch
self.register_buffer('rgb_mean', torch.tensor([123.675, 116.28, 103.53]).view(1, 3, 1, 1).cuda())
self.register_buffer('rgb_std', torch.tensor([58.395, 57.12, 57.375]).view(1, 3, 1, 1).cuda())
self.input_size = (616, 1064)
def normalize_image(self, image):
image = image - self.rgb_mean
image = image / self.rgb_std
return image
def forward(self, image, P):
original_image = image.clone()
image = self.normalize_image(image)
with torch.no_grad():
pred_depth, confidence, output_dict = self.meta_arch.inference({'input': image})
canonical_to_real_scale = (P[:, 0, 0, None, None] + P[:, 1, 1, None, None] ) / 2*1000.0 # 1000.0 is the focal length of canonical camera
print(canonical_to_real_scale.shape, pred_depth.shape)
pred_depth = pred_depth * canonical_to_real_scale # now the depth is metric
return pred_depth
## In testing
dummy_P = np.zeros([1, 3, 4], dtype=np.float32)
outputs = ort_session.run(None, {"image": dummy_image, "P": dummy_P})
Any reshape operations before getting the (616, 1064) should be accompanied by changes in the camera matrix
But if you are considering changing the input size to the network, I have not succeeded in doing so myself, I am afraid there could be errors inside the ViT network? @YvanYin Any ideas on changing the input to the network?
Thanks for your reply @Owen-Liuyuxuan. Actually, I am considering changing the input size to network.
By slightly changing the input shape to the network, the onnx model works (at least with no weird errors). However, I am not sure about the generalization ability and the metric accuracy. I will give it a test.
I have tried on personal data. It works but the canonical camera focal length is not necessarily 500. I believe you could try to fine-tune the parameters for your usage.
For my scene, it is about 1000/sqrt(2), I don't know why though.
BTW, for TensorRT usage, we need to clean up the cache every time before constant parameter changes can apply, so I suggest doing tuning in GPU first.