PointCloud Align Problem
ZakeyShi opened this issue · 5 comments
Hi👋 , thx for your great work!
Recently, I encountered some issues while using Hyersim dataset.
I want to obtain the intrinsic parameters and pose of the camera for each image, in order to reconstruct the 3D scene using depth ground truth.
-
intrinsic: a 3x3 array.
-
$f_x$ ,$f_y$ are the focal lengths (inpixels/mm
). -
$c_x$ and$c_y$ are the principal point coordinates (inpixels
)
-
-
pose: a 4x4 array. This pose matrix T represents the transformation from the camera coordinate system to the world coordinate system.
-
$R_{3\times3}$ is the rotation matrix. -
$t_{3\times1}$ is the translation vector. -
$r_{ij}$ are the elements of the rotation matrix. -
$t_x$ ,$t_y$ ,$t_z$ are the elements of the translation vector.
-
Before asking this question, I have searched for related issues, such as #9, #73, and #72, and have also carefully read the readme documentation.
For the camera's intrinsic matrix, I used file metadata_camera_parameters. It appears that whether it's cam_00
or cam_01
, the same camera intrinsics are used for one scene_name
. Specifically, I performed the conversion using the following code:
df_camera_parameters = pd.read_csv(metadata_path, index_col="scene_name")
df_ = df_camera_parameters.loc[scene_name]
width_pixels = int(df_["settings_output_img_width"])
height_pixels = int(df_["settings_output_img_height"])
M_proj = np.array(
[
[df_["M_proj_00"], df_["M_proj_01"], df_["M_proj_02"], df_["M_proj_03"]],
[df_["M_proj_10"], df_["M_proj_11"], df_["M_proj_12"], df_["M_proj_13"]],
[df_["M_proj_20"], df_["M_proj_21"], df_["M_proj_22"], df_["M_proj_23"]],
[df_["M_proj_30"], df_["M_proj_31"], df_["M_proj_32"], df_["M_proj_33"]],
]
)
# matrix to map to integer screen coordinates from normalized device coordinates
M_screen_from_ndc = np.matrix(
[
[0.5 * (width_pixels - 1), 0, 0, 0.5 * (width_pixels - 1)],
[0, -0.5 * (height_pixels - 1), 0, 0.5 * (height_pixels - 1)],
[0, 0, 0.5, 0.5],
[0, 0, 0, 1.0],
]
)
# Extract fx, fy, cx and cy, and build a corresponding matrix
screen_from_cam = M_screen_from_ndc @ M_proj
fx = abs(screen_from_cam[0, 0])
fy = abs(screen_from_cam[1, 1])
cx = abs(screen_from_cam[0, 2])
cy = abs(screen_from_cam[1, 2])
K = np.eye(4, dtype=np.float32)
K[0, 0] = float(fx)
K[1, 1] = float(fy)
K[0, 2] = float(cx)
K[1, 2] = float(cy)
For the camera pose calculation, I used the following script. It converts the translation vector's units to meters by using the scale factor from metadata_scene.csv
file. Additionally, I transformed the OpenGL coordinate system to the OpenCV coordinate system.
camera_dir = os.path.join(scene_name, "_detail", camera_name)
camera_positions_hdf5_file = os.path.join(root_dir, camera_dir, "camera_keyframe_positions.hdf5")
camera_orientations_hdf5_file = os.path.join(root_dir, camera_dir, "camera_keyframe_orientations.hdf5")
with h5py.File(camera_positions_hdf5_file, "r") as f: camera_positions = f["dataset"][int(frame_id)]
with h5py.File(camera_orientations_hdf5_file, "r") as f: camera_orientations = f["dataset"][int(frame_id)]
# %% Get scale factor from asset coordinates to meters
scene_metadata = pd.read_csv(Path(root_dir) / scene_name / '_detail' / "metadata_scene.csv")
meters_per_asset_unit = scene_metadata[
scene_metadata.parameter_name == "meters_per_asset_unit"
]
assert len(meters_per_asset_unit) == 1 # Should not be multiply defined
meters_per_asset_unit = meters_per_asset_unit.parameter_value[0]
scale_factor = float(meters_per_asset_unit)
# %% Get camera pose and intrinsics
camera_position_world = camera_positions
R_world_from_cam = camera_orientations
t_world_from_cam = np.array(camera_position_world).T
M_world_from_cam = np.zeros((4, 4))
M_world_from_cam[:3, :3] = R_world_from_cam
M_world_from_cam[:3, 3] = t_world_from_cam * scale_factor
M_world_from_cam[3, 3] = 1.0
world_T_cam = M_world_from_cam.astype(np.float32)
gl_to_cv = np.array([[1, -1, -1, 1], [-1, 1, 1, -1], [-1, 1, 1, -1], [1, 1, 1, 1]])
world_T_cam *= gl_to_cv
rot_mat = world_T_cam[:3, :3]
trans = world_T_cam[:3, 3]
rot_mat = rotx(-np.pi / 2) @ rot_mat
trans = rotx(-np.pi / 2) @ trans
world_T_cam[:3, :3] = rot_mat
world_T_cam[:3, 3] = trans
cam_T_world = np.linalg.inv(world_T_cam)
return world_T_cam, cam_T_world
As for the depth ground truth, I processed it according to #9.
However, I still cannot align multiple 3D depth reconstructed point clouds 😭 . The results are as follows:
I am a SLAMer and not very familiar with rendering-related knowledge. The above question might be naive, but I would be incredibly grateful if you could help me solve this problem.
Hi @ZakeyShi. Great question. I'll go through it in detail, but I just wanted to quickly say that I appreciate the care and attention to detail in your question.
The easiest way to construct a ground truth point cloud is not to use our depth
images, but to instead use our position
images, because our position
images already contain world-space positions and can be trivially concatenated together to form a point cloud. Even if you want to ultimately use depth
images, I would start with position
images to make sure the rest of your code is behaving as you expect. If this works for your application, you can stop here.
If you want to use our depth
images to construct a ground truth point cloud (e.g., because your application requires you to define an image formation model in terms of the depth
images or something), this is also possible, but it requires a couple of extra steps. Recall that our depth_meters
images should really be called distance_meters
images because they contain Euclidean distances along camera rays, not planar depths, as discussed in #9. So, in order to construct a globally aligned point cloud, you need to compute the appropriate camera ray at each pixel location (based on the scene-specific camera parameters), and then use each depth
value as a distance along this ray where the surface point should be.
We provide code for computing all the necessary rays here. If you compute rays using this example code, and interpret depth
values as distances along these rays, it should be straightforward to obtain a globally aligned point cloud for each scene. In fact, if you convert the depth
values (actually distance values) from meters into asset units using the meters_per_asset_unit
parameter discussed in #10, then you should obtain world-space positions that exactly match our position
images.
Hi @mikeroberts3000 !Thank you for your quick and detailed reply, which has been very enlightening for me.
I discovered that there was an error in my point cloud visualization code on my end.
Specifically, I stored the depthGT
in millimeters, but when visualizing the point cloud, I mistakenly performed operations with the pose matrix directly, without converting the depthGT
from millimeters to meters. This caused inconsistency between the units of the depthGT
(millimeters) and the translation vector
After correcting this careless bug, the point cloud visualization is now working correctly 🎉.
Just one more question, can I directly use the list in #22 (comment) to skip these broken scenes? I'm not sure if this list comprehensively includes all potentially broken scenes.
Or should I write my own logic to filter out other potential error scenes? If so, do you have any good suggestions? 😊
That point cloud looks great! I'm happy you figured out your bug.
Our standard split file (evermotion_dataset/analysis/metadata_images_split_scene_v1.csv
) does a pretty good job of filtering out broken scenes and camera trajectories, and provides a short explanation for each filtering decision. But the problematic scenes and trajectories in #22 slipped through our filtering procedure.
If you filter based on our standard split file and #22, I believe that is sufficient to exclude all broken scenes. Please post in #22 if you find any other broken scenes. And if you're not using our standard split, please be sure to document whatever additional filtering steps you're doing in your papers.
Hey @ZakeyShi, thank you for your well-formulated question!
I am trying to get the same output as you: get intrinsics (3x3), poses (4x4), and the depth map.
But I am struggling to get the views aligned. Do you mind sharing your entire code for the PCL generation?
My code for loading the camera parameters and depth map is here: gist.
Input views
Output PCL

@thibble instead of attempting to obtain someone else's end-to-end solution to solve your problem, I recommend the following. Slow down, read through this discussion carefully, study the example code that is already linked, and write a well-formulated question of your own if you're still stuck.