How to improve the metric depth value?
MoAbbasid opened this issue · 9 comments
Hi, I need to get distance to an object, I gathered a small dataset of outdoor images at a varied distances to test, and the model results are varied,
My Questions are:
What is the best practice to improve the results?, I already calibrated and have the intrinsics, what else can I do?
the model is clipped to be under a certain value to account for the sky correct?
My images are w=3024, h=4032, I provided the code I use to generate depth and visualization below
The red dot point is 4m from the camera, but I got (11.45) from the vit-small model and (20.764) from vit-large model, obviously way off.
another test I ran at 2m produced (1.3) fro, vit_small and (1.5) for vit_large, which is still not ideal but workable.
rgb_file = '/content/MG_5u_4m.jpg'
input_size = (616, 1064)
intrinsic = [3000, 3000, 1529.95662, 1976.17563] # camera's intrinsic parameters
padding_values = [123.675, 116.28, 103.53]
# Load and preprocess image
rgb_origin = cv2.imread(rgb_file)[:, :, ::-1]
# Adjust input size to fit the model
h, w = rgb_origin.shape[:2]
scale = min(input_size[0] / h, input_size[1] / w)
rgb = cv2.resize(rgb_origin, (int(w * scale), int(h * scale)), interpolation=cv2.INTER_LINEAR)
# Scale intrinsic parameters
intrinsic = [intrinsic[0] * scale, intrinsic[1] * scale, intrinsic[2] * scale, intrinsic[3] * scale]
# Padding
h, w = rgb.shape[:2]
pad_h = input_size[0] - h
pad_w = input_size[1] - w
pad_h_half = pad_h // 2
pad_w_half = pad_w // 2
rgb = cv2.copyMakeBorder(rgb, pad_h_half, pad_h - pad_h_half, pad_w_half, pad_w - pad_w_half, cv2.BORDER_CONSTANT, value=padding_values)
pad_info = [pad_h_half, pad_h - pad_h_half, pad_w_half, pad_w - pad_w_half]
# Normalize
mean = torch.tensor([123.675, 116.28, 103.53]).float()[:, None, None]
std = torch.tensor([58.395, 57.12, 57.375]).float()[:, None, None]
rgb = torch.from_numpy(rgb.transpose((2, 0, 1))).float()
rgb = torch.div((rgb - mean), std)
rgb = rgb[None, :, :, :].cuda()
# Load model
model = torch.hub.load('yvanyin/metric3d', 'metric3d_vit_small', pretrain=True)
model.cuda().eval()
# cuda()
# Perform inference
with torch.no_grad():
pred_depth, confidence, output_dict = model.inference({'input': rgb})
# un pad
pred_depth = pred_depth.squeeze()
pred_depth = pred_depth[pad_info[0] : pred_depth.shape[0] - pad_info[1], pad_info[2] : pred_depth.shape[1] - pad_info[3]]
# upsample to original size
pred_depth = torch.nn.functional.interpolate(pred_depth[None, None, :, :], rgb_origin.shape[:2], mode='bilinear').squeeze()
###################### canonical camera space ######################
#### de-canonical transform
canonical_to_real_scale = intrinsic[0] / 1000.0 # 1000.0 is the focal length of canonical camera
pred_depth = pred_depth * canonical_to_real_scale # now the depth is metric
pred_depth = torch.clamp(pred_depth, 0, 300)
Any info to get this close to the real world scale is appreciated
@MoAbbasid Hello, if you can only adjust the post-processing part without changing the model, what I have been trying is to adjust fx and fy in its internal parameter matrix.
Hi @oywenjun11 , I obtained these values by calibrating with opencv, using the chessboard pattern, do you suggest just randomly changing these values? and keep cx, cy, the same?
你好@oywenjun11,我通过使用 opencv 校准获得这些值,使用棋盘图案,你建议随机改变这些值吗?并保持 cx、cy 不变?
@MoAbbasid Hello, my suggestion is to try modifying fx and fy. It is not necessarily the result after you have calibrated opencv. Because this part of the post-processing is simply scaling the predicted depth map results.
@MoAbbasid Hello, I have also used OpenCV's chessboard calibration to calibrate the intrinsic parameters, but I found that its results fluctuate a lot (probably due to the strict quality requirements for the calibration images), making it difficult to calibrate accurately. It even performs worse than using default parameters.
Hello @oywenjun11 @sukasu403 , I tried adjusting the focal length fx, fy, but that doesnt work as the result is not consistent, meaning they are different from image to image,
in the same image where the distance is 4m, I got the desired result at f=500
but for the second picture where the gt is actually 2m, I got the desired result at f=250
so its not consistent in different pictures,
what else can I try to do to get some uniform metric values?
could it be that the sky values are causing the depth result to be large?
As most training images have larger widths, so maybe can try to adjust the height/width.
@YvanYin,
so I flipped the height and width,
in the 2m GT image I got 2m at intrinsic = [500, 500, 1512, 2016]
in the 4m GT image I got 4m at intrinsic = [750, 750, 1512, 2016]
varied results for other distances and images as well, Im using vit_large, any other suggestion?
how big of a role does the sky values play?
@YvanYin, so I flipped the height and width,
in the 2m GT image I got 2m at intrinsic = [500, 500, 1512, 2016]
in the 4m GT image I got 4m at intrinsic = [750, 750, 1512, 2016]
varied results for other distances and images as well, Im using vit_large, any other suggestion? how big of a role does the sky values play?
Oh we normally do not transpose the height and width. For the sky values... we suggest just use the confidence map to filter them out.
For the inconsistency, what about center crop the first image and resize it to the original size? I think this inconsistency is largely caused by insufficient training data with large image size and small focal length.