PointCloud Align Problem

Question

PointCloud Align Problem

ZakeyShi opened this issue 6 months ago · 5 comments

Hi👋 , thx for your great work!

Recently, I encountered some issues while using Hyersim dataset.

I want to obtain the intrinsic parameters and pose of the camera for each image, in order to reconstruct the 3D scene using depth ground truth.

intrinsic: a 3x3 array.
- $f_x$, $f_y$ are the focal lengths (in pixels/mm).
- $c_x$ and $c_y$ are the principal point coordinates (in pixels)

$$ K = \begin{bmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix} $$

pose: a 4x4 array. This pose matrix T represents the transformation from the camera coordinate system to the world coordinate system.
- $R_{3\times3}$ is the rotation matrix.
- $t_{3\times1}$ is the translation vector.
- $r_{ij}$ are the elements of the rotation matrix.
- $t_x$, $t_y$, $t_z$ are the elements of the translation vector.

$$ T = \begin{bmatrix} R_{3\times3} & t_{3\times1} \\ 0_{1\times3} & 1 \end{bmatrix} = \begin{bmatrix} r_{11} & r_{12} & r_{13} & t_x \\ r_{21} & r_{22} & r_{23} & t_y \\ r_{31} & r_{32} & r_{33} & t_z \\ 0 & 0 & 0 & 1 \end{bmatrix} $$

Before asking this question, I have searched for related issues, such as #9, #73, and #72, and have also carefully read the readme documentation.

For the camera's intrinsic matrix, I used file metadata_camera_parameters. It appears that whether it's cam_00 or cam_01, the same camera intrinsics are used for one scene_name. Specifically, I performed the conversion using the following code:

df_camera_parameters = pd.read_csv(metadata_path, index_col="scene_name")
    df_ = df_camera_parameters.loc[scene_name]

    width_pixels = int(df_["settings_output_img_width"])
    height_pixels = int(df_["settings_output_img_height"])

    M_proj = np.array(
        [
            [df_["M_proj_00"], df_["M_proj_01"], df_["M_proj_02"], df_["M_proj_03"]],
            [df_["M_proj_10"], df_["M_proj_11"], df_["M_proj_12"], df_["M_proj_13"]],
            [df_["M_proj_20"], df_["M_proj_21"], df_["M_proj_22"], df_["M_proj_23"]],
            [df_["M_proj_30"], df_["M_proj_31"], df_["M_proj_32"], df_["M_proj_33"]],
        ]
    )

    # matrix to map to integer screen coordinates from normalized device coordinates
    M_screen_from_ndc = np.matrix(
        [
            [0.5 * (width_pixels - 1), 0, 0, 0.5 * (width_pixels - 1)],
            [0, -0.5 * (height_pixels - 1), 0, 0.5 * (height_pixels - 1)],
            [0, 0, 0.5, 0.5],
            [0, 0, 0, 1.0],
        ]
    )

    # Extract fx, fy, cx and cy, and build a corresponding matrix
    screen_from_cam = M_screen_from_ndc @ M_proj
    fx = abs(screen_from_cam[0, 0])
    fy = abs(screen_from_cam[1, 1])
    cx = abs(screen_from_cam[0, 2])
    cy = abs(screen_from_cam[1, 2])

    K = np.eye(4, dtype=np.float32)
    K[0, 0] = float(fx)
    K[1, 1] = float(fy)
    K[0, 2] = float(cx)
    K[1, 2] = float(cy)

For the camera pose calculation, I used the following script. It converts the translation vector's units to meters by using the scale factor from metadata_scene.csv file. Additionally, I transformed the OpenGL coordinate system to the OpenCV coordinate system.

    camera_dir = os.path.join(scene_name, "_detail", camera_name)

    camera_positions_hdf5_file    = os.path.join(root_dir, camera_dir, "camera_keyframe_positions.hdf5")
    camera_orientations_hdf5_file = os.path.join(root_dir, camera_dir, "camera_keyframe_orientations.hdf5")

    with h5py.File(camera_positions_hdf5_file,    "r") as f: camera_positions    = f["dataset"][int(frame_id)]
    with h5py.File(camera_orientations_hdf5_file, "r") as f: camera_orientations = f["dataset"][int(frame_id)]

    # %% Get scale factor from asset coordinates to meters
    scene_metadata = pd.read_csv(Path(root_dir) / scene_name / '_detail' / "metadata_scene.csv")
    meters_per_asset_unit = scene_metadata[
        scene_metadata.parameter_name == "meters_per_asset_unit"
    ]
    assert len(meters_per_asset_unit) == 1  # Should not be multiply defined
    meters_per_asset_unit = meters_per_asset_unit.parameter_value[0]
    scale_factor = float(meters_per_asset_unit)

    # %% Get camera pose and intrinsics
    camera_position_world = camera_positions
    R_world_from_cam = camera_orientations
    t_world_from_cam = np.array(camera_position_world).T

    M_world_from_cam = np.zeros((4, 4))
    M_world_from_cam[:3, :3] = R_world_from_cam
    M_world_from_cam[:3, 3] = t_world_from_cam * scale_factor
    M_world_from_cam[3, 3] = 1.0

    world_T_cam = M_world_from_cam.astype(np.float32)

    gl_to_cv = np.array([[1, -1, -1, 1], [-1, 1, 1, -1], [-1, 1, 1, -1], [1, 1, 1, 1]])

    world_T_cam *= gl_to_cv

    rot_mat = world_T_cam[:3, :3]
    trans = world_T_cam[:3, 3]

    rot_mat = rotx(-np.pi / 2) @ rot_mat
    trans = rotx(-np.pi / 2) @ trans

    world_T_cam[:3, :3] = rot_mat
    world_T_cam[:3, 3] = trans

    cam_T_world = np.linalg.inv(world_T_cam)

    return world_T_cam, cam_T_world

As for the depth ground truth, I processed it according to #9.

However, I still cannot align multiple 3D depth reconstructed point clouds 😭 . The results are as follows:

I am a SLAMer and not very familiar with rendering-related knowledge. The above question might be naive, but I would be incredibly grateful if you could help me solve this problem.

Answer 1 · 2024-08-21T12:14:16.000Z

Hi @ZakeyShi. Great question. I'll go through it in detail, but I just wanted to quickly say that I appreciate the care and attention to detail in your question.

The easiest way to construct a ground truth point cloud is not to use our depth images, but to instead use our position images, because our position images already contain world-space positions and can be trivially concatenated together to form a point cloud. Even if you want to ultimately use depth images, I would start with position images to make sure the rest of your code is behaving as you expect. If this works for your application, you can stop here.

If you want to use our depth images to construct a ground truth point cloud (e.g., because your application requires you to define an image formation model in terms of the depth images or something), this is also possible, but it requires a couple of extra steps. Recall that our depth_meters images should really be called distance_meters images because they contain Euclidean distances along camera rays, not planar depths, as discussed in #9. So, in order to construct a globally aligned point cloud, you need to compute the appropriate camera ray at each pixel location (based on the scene-specific camera parameters), and then use each depth value as a distance along this ray where the surface point should be.

We provide code for computing all the necessary rays here. If you compute rays using this example code, and interpret depth values as distances along these rays, it should be straightforward to obtain a globally aligned point cloud for each scene. In fact, if you convert the depth values (actually distance values) from meters into asset units using the meters_per_asset_unit parameter discussed in #10, then you should obtain world-space positions that exactly match our position images.

Answer 2 · 2024-08-22T03:28:29.000Z

Hi @mikeroberts3000 ！Thank you for your quick and detailed reply, which has been very enlightening for me.

I discovered that there was an error in my point cloud visualization code on my end.

Specifically, I stored the depthGT in millimeters, but when visualizing the point cloud, I mistakenly performed operations with the pose matrix directly, without converting the depthGT from millimeters to meters. This caused inconsistency between the units of the depthGT (millimeters) and the translation vector $t$ in the pose matrix (meters) during calculations, resulting in misalignment of point clouds between different frames 😂 .

After correcting this careless bug, the point cloud visualization is now working correctly 🎉.

Just one more question, can I directly use the list in #22 (comment) to skip these broken scenes? I'm not sure if this list comprehensively includes all potentially broken scenes.

Or should I write my own logic to filter out other potential error scenes? If so, do you have any good suggestions? 😊

Answer 3 · 2024-08-22T08:48:56.000Z

That point cloud looks great! I'm happy you figured out your bug.

Our standard split file (evermotion_dataset/analysis/metadata_images_split_scene_v1.csv) does a pretty good job of filtering out broken scenes and camera trajectories, and provides a short explanation for each filtering decision. But the problematic scenes and trajectories in #22 slipped through our filtering procedure.

If you filter based on our standard split file and #22, I believe that is sufficient to exclude all broken scenes. Please post in #22 if you find any other broken scenes. And if you're not using our standard split, please be sure to document whatever additional filtering steps you're doing in your papers.

Answer 4 · 2024-08-23T12:18:04.000Z

Hey @ZakeyShi, thank you for your well-formulated question!

I am trying to get the same output as you: get intrinsics (3x3), poses (4x4), and the depth map.

But I am struggling to get the views aligned. Do you mind sharing your entire code for the PCL generation?

My code for loading the camera parameters and depth map is here: gist.

Input views

Output PCL

Answer 5 · 2024-08-23T16:17:54.000Z

@thibble instead of attempting to obtain someone else's end-to-end solution to solve your problem, I recommend the following. Slow down, read through this discussion carefully, study the example code that is already linked, and write a well-formulated question of your own if you're still stuck.