HKUST-Aerial-Robotics/Flow-Motion-Depth

Is depth map on absolute scale? Can the data be used to warp one frame to the view of next frame?

NagabhushanSN95 opened this issue · 15 comments

Hi,

Thank you for sharing the database.
Are the depth values on the absolute scale? If yes, are they in meters or something different?

I'm planning to use your dataset for a view-synthesis-related problem I'm working on. Can you please confirm if your database will be suitable for that? To elaborate, I need to warp a frame to the view of the next frame using the relative transformation matrix, depth map, and camera intrinsic matrix.

I tried to do the warping using depth * 10 based on this issue between the first two frames of the sample dataset you've provided. But the warped frame is not matching the next frame.

I tried different permutations of the translation and rotation values. I even tried for different depth multipliers. None of them worked. Can you also please specify the coordinates in which the camera pose is defined?

The transformation matrices for first two frames are

[[ 8.33396038e-01  6.57202359e-02 -5.48754858e-01  8.58303600e+00], 
[ 5.51991914e-01 -1.48374374e-01  8.20542486e-01  4.93395081e+02], 
[-2.74949125e-02 -9.86745102e-01 -1.59931654e-01  1.76480530e+02], 
[ 0.00000000e+00  0.00000000e+00  0.00000000e+00  1.00000000e+00]]

[[ 7.77160474e-01  6.57259766e-02 -6.25860762e-01  1.54254060e+01], 
[ 6.29285969e-01 -8.83683018e-02  7.72133545e-01  4.94666809e+02], 
[-4.55702145e-03 -9.93917069e-01 -1.10036786e-01  1.74095459e+02],
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  1.00]]

When I compute the relative transformation matrix and decompose them into translation values and euler angles (xyz format), I get the below values.

translation = [ 50.96591467  12.85551263 -29.01165249]
angles_in_degree = [3.13384302 0.50048769 5.44720086]

Can you please specify which direction of motion do each of the above values represent?

All the scale and coordinate are defined within the game, I am not sure they are in meters. But the scale in the pose and depth images are surely consistent with each other.
The pose is the camera pose in the world frame.

Also, how did you calculate the "relative transformation"? Your result does not seem to be right.

Oh great! Thanks. I think I first need to convert camera pose from world coordinates to camera coordinates. Is it documented anywhere on how to do this? I mean, I just need to know the how world coordinates are defined.

The exact world coordinate is defined within the game engine, I do not know why it's important for your application.
The pose in the dataset is the "camera pose in the world frame", all we need is the consistent relative camera poses, right?

I'm warping the images based on the following equation

If p1 is the location of a point in frame1 and p2 is where it has moved in frame2, 3D point is given by q1 = d(p1) * K^-1 * p1 in camera1's coordinates, where d(p1) is the depth of p1. Now, to convert it to camera2's coordinates, we do q2 = R*q1 + T where R and T are the rotation and translation between camera1 and camera2. Finally, we get the reprojected point as p2 = K * q2 and then dividing by the 3rd coordinate.

For this, R and T should not be in world coordinates, right?

I'm also adding two related info, in case it helps

  1. My colleague rendered videos from blender. Pulling the pose of camera in blender gives it in world coordinates. She used the following code to transform it to camera coordinates, so that we can warp using the above equation.
def get_4x4_RT_matrix_from_blender(cam):
    # bcam stands for blender camera
    R_bcam2cv = Matrix(
        ((1, 0,  0),
        (0, -1, 0),
        (0, 0, -1)))

    translation = cam.location
    angles = cam.rotation_euler
    R_world2bcam = angles.to_matrix().transposed()
    T_world2bcam = -1*R_world2bcam @ translation
    
    # Build the coordinate transform matrix from world to computer vision camera
    R_world2cv = R_bcam2cv@R_world2bcam
    T_world2cv = R_bcam2cv@T_world2bcam
    filler_row = Vector((0,0,0,1))

    # put into 3x4 matrix
    RT = Matrix((
        R_world2cv[0][:] + (T_world2cv[0],),
        R_world2cv[1][:] + (T_world2cv[1],),
        R_world2cv[2][:] + (T_world2cv[2],), 
        filler_row
        ))

    RT = numpy.ravel(numpy.asarray(RT))
    return RT

So, I think it is necessary to convert camera pose from world coordnates to camera coordinates. To understand intuitively, in world coordinates, if camera moves to left, in images objects move towards the right. So, sign of translation values has to be reversed right? Similarly for rotation matrix.

  1. I also tried on MVS Synth database. The pose given there worked directly. I could warp it using the equations I described above and it worked seamlessly.

Also, how did you calculate the "relative transformation"? Your result does not seem to be right.

I used the below code

from scipy.spatial.transform import Roation

transformation = numpy.matmul(transformation2, numpy.linalg.inv(transformation1))
translation_vec = transformation[:3, 3]
rotation_mat = transformation[:3, :3]
rotation_obj = Rotation.from_matrix(rotation_mat)
angles = rotation_obj.as_euler('xyz', degrees=True)

where transformation1 and transformation2 are the matrices I posted in my second comment here.

Also, for warping, I don't explicitly convert the values into rotation and translation. I just converted them here so that it might be easier to understand the motion. For my warping, I directly use the transformation matrix computed above.

"For this, R and T should not be in world coordinates, right?"
R and T should be the relative pose of two cameras.
try the following:
cam0_pose = np.array(
[[ 8.33396038e-01 6.57202359e-02 -5.48754858e-01 8.58303600e+00],
[ 5.51991914e-01 -1.48374374e-01 8.20542486e-01 4.93395081e+02],
[-2.74949125e-02 -9.86745102e-01 -1.59931654e-01 1.76480530e+02],
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 1.00000000e+00]])

cam1_pose = np.array(
[[ 7.77160474e-01 6.57259766e-02 -6.25860762e-01 1.54254060e+01],
[ 6.29285969e-01 -8.83683018e-02 7.72133545e-01 4.94666809e+02],
[-4.55702145e-03 -9.93917069e-01 -1.10036786e-01 1.74095459e+02],
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 1.00]])

cam1_to_cam0 = np.matmul(inv(cam0_pose), cam1_pose)

I tried using the above relative transformation, the warped image is as below. It is not matching

frame1
frame2
frame2_warped

I'll find some time to look at the data within this week.

Cool. This goes without saying, still, I really appreciate the time you're taking to help me out :)

Also, if it helps, my code is based on the code here

Adding a part of the code here, that computes the transformed coordinates for the points

    transformation = numpy.matmul(transformation2, numpy.linalg.inv(transformation1))

    y1d = numpy.array(range(h))
    x1d = numpy.array(range(w))
    x2d, y2d = numpy.meshgrid(x1d, y1d)
    ones_2d = numpy.ones(shape=(h, w))
    ones_4d = ones_2d[:, :, None, None]
    pos_vectors_homo = numpy.stack([x2d, y2d, ones_2d], axis=2)[:, :, :, None]

    intrinsic_inv = numpy.linalg.inv(intrinsic)
    intrinsic_4d = intrinsic[None, None]
    intrinsic_inv_4d = intrinsic_inv[None, None]
    depth_4d = depth1[:, :, None, None]
    trans_4d = transformation[None, None]

    unnormalized_pos = numpy.matmul(intrinsic_inv_4d, pos_vectors_homo)
    unit_pos_vecs = unnormalized_pos / numpy.linalg.norm(unnormalized_pos, axis=2, keepdims=True)
    world_points = depth_4d * unit_pos_vecs
    world_points_homo = numpy.concatenate([world_points, ones_4d], axis=2)
    trans_world_homo = numpy.matmul(trans_4d, world_points_homo)
    trans_world = trans_world_homo[:, :, :3]
    trans_norm_points = numpy.matmul(intrinsic_4d, trans_world)
    trans_pos = trans_norm_points[:, :, :2, 0] / trans_norm_points[:, :, 2:3, 0]  # transformed positions

try this, I think you can try this one. I found this can associate correct points.

import h5py 
import cv2
import numpy as np

sample_path = "20190124_203632.hdf5"
h5file = h5py.File(sample_path,"r")
img_num = int(len(h5file.keys()) / 4)
print("the sequence has %d images"%img_num)

img0 = cv2.imdecode(h5file["image_0"][:],cv2.IMREAD_COLOR)
K0 = h5file["K_0"][:]
pose0 = h5file["pose_0"][:]
depth0 = h5file["depth_0"][:]
print(K0)
print(pose0.shape)
print(depth0.shape)

img1 = cv2.imdecode(h5file["image_1"][:],cv2.IMREAD_COLOR)
K1 = h5file["K_1"][:]
pose1 = h5file["pose_1"][:]
depth1 = h5file["depth_1"][:]

u = 300
v = 240
pixel = np.array([u, v, 1.0])
Kinv = np.linalg.inv(K0)
pixel_dir = np.matmul(Kinv, pixel) * depth0[v, u]
relative_pose = np.matmul(np.linalg.inv(pose1), pose0)
transfered_point = np.matmul(relative_pose[:3,:3], pixel_dir) + relative_pose[:3,3]
transfered_pixel = np.matmul(K0, transfered_point)
u_ = int(transfered_pixel[0] / transfered_point[2])
v_ = int(transfered_pixel[1] / transfered_point[2])

cv2.circle(img0, (u,v), 3, (0,0,255), -1)
cv2.circle(img1, (u_,v_), 3, (0,0,255), -1)

cv2.imshow("img0", img0)
cv2.imshow("img1", img1)
cv2.waitKey(0)

This works. Thanks a lot. I'm confused what I did wrong. I'll debug by comparing with this and figure it out. Thanks a lot!

Okay. I fixed my warping code. Few errors/differences (for future users)

  1. relative transformation is inv(T1)T0 and not T1inv(T0)
  2. Do not normalize homogeneous pixel coordinates (which usually gives better results for blender rendered frames)
  3. Do not multiply depth by 10. The values can be used directly
  4. Do not correct the [0,0] and [1,1] entries of the camera matrix as suggested for MVS Synth low res database