mgonzs13/yolov8_ros

IndexError/ out of bounds in detect_3d_node using RealSense

adamwhats opened this issue · 13 comments

Hi there,

Firstly thanks for the great repo.

I'm having some issues while trying to use the detect_3d_node.py where the convert_bb_to_3d function always gives me an IndexError.

  File "/workspaces/yolov8_ws/src/yolov8_ros/yolov8_ros/yolov8_ros/detect_3d_node.py", line 91, in on_detections
    bbox3d = self.convert_bb_to_3d(points, detection)
  File "/workspaces/yolov8_ws/src/yolov8_ros/yolov8_ros/yolov8_ros/detect_3d_node.py", line 153, in convert_bb_to_3d
    center_point = points[int(center_y)][int(center_x)]
IndexError: index 263 is out of bounds for axis 0 with size 1

Looking at line center_point = points[int(center_y)][int(center_x)] and also line 83 where the points array is constructed, it seems like it expects the point_msg.height and point_msg.width to correspond to the input image size used in the detection node. Unfortunately the depth camera/ ros2 driver I'm using (Realsense D435) constructs the PointCloud2 message with a height of 1, meaning the points array ends up with a shape of [1, num_points, 3].

I'm happy to work on a PR to fix this if you would like? I feel that assuming the PointCloud2 message width and height matches those of the colour image isn't the most robust solution. One approach I've used in the past has been to calculate the 3D bounding box by applying the segmentation mask to the depth image rather than the pointcloud, but I'm also happy to try something else if you have any suggestions.

Adam

Hi @adamwhats thanks for the comment. The easiest way to fix it may be to resize the points array to the image size. However, the use of the depth image is interesting. I have never used it, can it speed up the 3D bounding box creation?

Thanks for getting back to me. I don't think resizing the pointcloud will work for me unfortunately 😕 The realsense seems to only publish valid points, meaning that the pointcloud message size is constantly changing so any attempt to resize to the image size will error.

I've written a version of the convert_bb_to_3d which works by first segmenting the depth image and then using the depth camera_info message to project it out to 3D. I could probably optimize it a little more, but it currently takes about 11/12ms on my PC. Unfortunately I can't compare that with your original method due to this issue with the camera, however if you share a rosbag with me from your camera then I'd be happy to do a bit more testing to see which approach is faster?

Out of interest, which camera have you been using?

I am using an Asus Xtion which is the camera inside the robots in our lab( TIAGo and RB1). I think I have met the same problem with the realsense before but I don't remember how I fixed it. Maybe changing some parameters in the launch file. I'll try to record some bags this week. You may use a simulated robot. I think the RGBD camera should behave like the Asus I am using.

Here you have a bag. It is the one I have used in the videos of the README.

Appreciate you sharing that, but unfortunately there isn't a pointcloud topic in that one so I can't do a side-by-side comparison of the methods 😕

image

I just have a small question about the max_detection_threshold parameter to make sure I'm implementing things correctly. Is the expected behavior that any detection's beyond the threshold do not have a 3d bounding box?

The maximum_detection_threshold is the maximum distance, in the z-axis from the center of the 3D object, to be included in the 3D bounding box.

On the other hand, you can create the point cloud using depth_image_proc with the depth image. Here is the launch file I used:

import launch
import launch_ros.actions
import launch_ros.descriptions


def generate_launch_description():

    container = launch_ros.actions.ComposableNodeContainer(
        name='container',
        namespace=namespace,
        package='rclcpp_components',
        executable='component_container',
        composable_node_descriptions=[
                # Create XYZRGB point cloud
                launch_ros.descriptions.ComposableNode(
                    package='depth_image_proc',
                    plugin='depth_image_proc::PointCloudXyzrgbNode',
                    name='points_xyzrgb',
                    parameters=[{'queue_size': 10}],
                    remappings=[('rgb/image_rect_color', '/xtion/rgb/image_raw'),
                                ('rgb/camera_info', '/xtion/rgb/camera_info'),
                                ('depth_registered/image_rect',
                                 '/xtion/depth/image_raw'),
                                ('points', '/xtion/depth_registered/points'), ],
                ),
        ],
        output='screen',
    )

    return launch.LaunchDescription([container])

Thanks a lot for that. I've done a bit of testing and it looks like my current implementation is about 3x slower 😅 I've got a few ideas for improving it which I'll try work on this evening

Ok, I've had a chance to make some changes and it now runs at a comparable speed to your original approach, here's my results:
image
It's not the most sophisticated benchmarking, I've simply ran your rosbag three times each time while timing how long the on_detect callback in detect_3d_node.py took.

Would you like me to put together a PR? There's not much improvement in terms of performance, but I think by using the depth image rather than the pointcloud is more robust to different models of camera. No worries if not though

Thanks for the work. I have a couple of questions: Have you tried this with the normal model of yolov8m? Would you happen to know why your version is slower with the yolov8_pose?

Apologies for the slow reply. In answer to your questions:

  1. Yes this works with the normal model. I actually only use the bbox rather than the mask when processing the output from the segmentation model, meaning that detect_3d_node functions in the same way for both yolov8m and yolov8m-seg - as you can see below this is why there's pretty much the same percentage improvement

image

  1. So when there is keypoint data, 'detect_3d_node' runs both the 'convert_bb_to_3dandconvert_keypoints_to_3d functions. The new implementation of 'convert_bb_to_3d is faster than the original, but the new convert_keypoints_to_3d is slower. This is because in your original implementation you are able to simply look up the 3D point using the keypoint coordinates:
for p in detection.keypoints.data:

    if int(p.point.y) >= points.shape[0] or int(p.point.x) >= points.shape[1]:
        continue

    p3d = points[int(p.point.y)][int(p.point.x)]

whereas in mine I have to do the extra maths of projecting to 3D:

# Build an array of 2d keypoints
keypoints_2d = np.array([[p.point.x, p.point.y] for p in detection.keypoints.data], dtype=np.int16)
u = np.array(keypoints_2d[:, 1]).clip(0, depth_info.height - 1)
v = np.array(keypoints_2d[:, 0]).clip(0, depth_info.width - 1)

# sample depth image and project to 3D
z = depth_image[u, v]
k = depth_info.k
px, py, fx, fy = k[2], k[5], k[0], k[4]
x = z * (v - px) / fx
y = z * (u - py) / fy
points_3d = np.dstack([x, y, z]).reshape(-1, 3) / self.depth_image_units_divisor  # convert to meters

All in all, the slowdown in convert_keypoints_to_3d outweighs the improvement in convert_bb_to_3d

Great work and thanks for the comparison. Can you make a PR with the changes? Remember to remove logs and the time calculus.

#13 raised, so closing the issue 😄