MIT-SPARK/Hydra

Semantic segmentation + running Hydra on custom data

feramozur opened this issue ยท 18 comments

First of all, thank you for releasing such an interesting work in the field of Semantic SLAM.

Considering that the Semantic Segmentation Network used by Hydra is not yet ready for publication, I'd like to ask if there are alternatives that can be considered in the meantime, including the use of other networks, or the supply of custom "ground truth" data (in that case I'd like to know which format Hydra expects).

Additionally, I was wondering if there are plans to add support for custom bag files/sequences, or alternatively adding instructions to do so, reutilizing the existing launch file infrastructure. We are very interested in testing Hydra with data recorded from our own robot, which is fitted with an RGBD camera, and it could also provide odometry information directly (nav_msgs/Odometry) instead of requiring a visual-inertial odometry component. It is also not presently clear to us which topics are needed as inputs to the system.

Thank you for your interest in our work!

On the choice of a 2D semantic segmentation network: We actually use an off-the-shelf network for 2D semantic segmentation as mentioned in our paper (we use HRNet from the MIT Scene-Parsing benchmark). The pre-trained pytorch models can be found here. Using other networks or sourced of ground-truth semantics is also possible as long as you post-process the semantic labels and set up your configs correctly downstream.

Actually setting up 2D semantic segmentation for use with Hydra (and Kimera-Semantics): The code we haven't released yet just loads in a ONNX model (which we exported from pytorch) into TensorRT, subscribes to a RGB image topic, runs inference on received RBG images, and then condenses and colors the resulting semantic label image. You could also do the same with pytorch directly (we just wanted to avoid python2 to python3 issues in 18.04). There are typically two steps when setting up a new segmentation network:

  1. Kimera-Semantics is limited to doing inference on 20 categories, so it's likely you will have to condense the full set of predicted labels (from the 2D segmentation network) down into 20 groups. You can view our mapping here for the pre-trained model that I pointed to earlier.
  2. We tend to convert each of the 20 condensed labels to an unique color to help with visualization (i.e. mapping label 0 to red, label 1 to green, etc.). Kimera-Semantics requires a CSV mapping colors to labels. An example CSV for the pre-trained network that we used is available here. Note you could also just use the label as one of the channels, but that makes it hard to distinguish the different semantic classes in the mesh

Once you have that, you just need to combine the depth and semantic image using this nodelet in depth_image_proc to obtain the semantic pointcloud that Hydra expects as input.

Running with a different pose source

Our plan is to add instructions for configuring Hydra to work for new datasets and sensors, but it may be a little bit before I have a chance to write them up, and they will be specific to Kimera-VIO when I do write them up. That said, it's certainly possible to set up a launch file for a different sensor and odometry source. The base launch files that we share between our different datasets are here and here. You can look at how we specialize our launch files here and here respectively.

The two inputs that are required for Hydra to function are a semantic pointcloud (by default, Hydra subscribes to /semantic_pointcloud) and a TF frame with the current pose of the sensor producing the pointcloud, which is specified by argument here. Enabling our full loop closure detection pipeline is a little trickier. You'll want to make sure the arg use_gt_frame is set to true so that this node is actually enabled (which will save you having to publish the robot pose graph yourself). Then you have to publish BoW vectors with this message to /kimera_vio_ros/bow_vectors. To run with LCD enabled, you just have to run with enable_dsg_lcd:=true. If you don't have a source for visual descriptors, the scene graph LCD should still work, though we've found that the performance can be a little hit-or-miss when just using the scene graph for LCD instead of also using the vision-based descriptors.

Finally, if you end up working on standing up your own pipeline, I'd kindly request that if possible, you take notes on the process (that we can add as documentation) to help others also use Hydra (your experience may be more informative to other users than the instructions that I write-up). Please also don't hesitate to reach out with other questions! I'll leave this issue open for now as temporary documentation as I'm sure others have similar questions.

Hi @feramozur, have you been successful in using Hydra on data from your robot?

Do you think ZED 2i camera is a suitable sensor setup to try Hydra for real-time use?

Hi, yes the ZED 2i is likely suitable for use as an input to Hydra (and it's on our list of cameras to try out). We've used similar cameras in the past for Hydra (the Kinect Azure for the RSS paper and the realsense D455 in some of our testing) though the underlying stereo technologies will probably affect the reconstruction quality in different environments (i.e. the time-of-flight sensor in the kinect vs. the active/passive D455 combo vs. the passive zed with a larger baseline).

If you're trying to run with Kimera-VIO, then the zed or the d455 are reasonable options: the zed reconstruction quality won't suffer but isn't rolling shutter, and the d455 is global shutter on the infrared cameras (I believe) but requires turning off the IR emitter which can reduce the depth quality.

Thanks for your reply! Im trying to run ZED2i camera with Kimera-VIO, but it doesnt work as expected. The mesh coming out of it appears to be chopped and broken, very far from the mesh I`m getting from the EuRoC dataset. I generated the config files for the camera using cam_info_yamlized.launch. And below is my kimera_vio_ros_zed2i.launch file itself.

<launch>
  <arg name="dataset_name" value="ZED2i"/>
  <arg name="online" default="True" />

  <arg name="log_output" default="false"/>
  <arg name="log_output_path"
       default="$(find kimera_vio_ros)/output_logs/"
       if="$(arg log_output)"/>
  <arg name="log_gt_data" default="false"/>
  <arg name="gt_topic" default=""/>
  <!-- Set use_sim_time to true if you use rosbag with clock argument -->
  <arg name="should_use_sim_time" default="false"/>

  <!-- Only used when parsing a rosbag -->
  <arg name="rosbag_path" default="default.bag"
       unless="$(arg online)"/>"/>

  <arg name="visualize" default="true"/>

  <!-- Frame IDs. These DO NOT match frame id's on the video streams, as the 
	RealSense and Kimera publish conflicting Tf's -->

  <arg name="world_frame_id"     value="world"/>
  <arg name="base_link_frame_id" value="base_link1"/>
  <arg name="map_frame_id"       value="map1"/>
  <arg name="left_cam_frame_id"  value="left_cam"/>
  <arg name="right_cam_frame_id" value="right_cam"/>


  <arg name="run_stereo_dense"     default="false"/>

  <!-- Subscribed Topics -->
  <arg name="left_cam_topic"  value="/zed2i/zed_node/left/image_rect_gray"/>
  <arg name="right_cam_topic" value="/zed2i/zed_node/right/image_rect_gray"/>
  <arg name="imu_topic"       value="/zed2i/zed_node/imu/data"/>

  <!-- Launch actual pipeline -->
  <include file="$(find kimera_vio_ros)/launch/kimera_vio_ros.launch"
           pass_all_args="true"/>

 <!-- Launch static TF node from base_link to velo_link -->
 <node pkg="tf" type="static_transform_publisher" name="velo_link_broadcaster"
   args="0 0 0 0 0 0 1 world map 100"/>

</launch>

The mesh coming in RVIZ looks like this:

Screenshot from 2022-07-29 12-19-58

Can you please point me in the right direction? I`ve also had issues with ORB-SLAM3 and one opinion there is that ZED camera is a rolling shutter camera, while the algorithm requires a global shutter. Could this be an issue? Or is it just poor calibration of the camera? Many thanks in advance, any hint would be highly appreciated!

I'm unfortunately not a maintainer for Kimera VIO so I probably won't be able to help too much, but will say that the mesher output that you're seeing isn't really related to the quality of the mesh that Kimera Semantics produces, which is what Hydra actually uses for reconstruction (i.e. getting the mesher in Kimera VIO to work is irrelevant for Hydra). As long as you get a reasonable pointcloud out of the zed camera and a reasonable pose estimate from Kimera VIO, you should be set for running with Hydra (after setting up a new config and launch files for it).

w.r.t. rollling shutter / calibration: Rolling shutter effects can affect pose estimation quality, but are most severe under high velocity or fast rotations and I doubt it would be causing catastrophic issues. There could be issues with the intrinsics calibration for the zed if you did it yourself, as the left/right images might already be undistorted to do passive stereo (and calibrating an already undistorted image can lead to problems). However, having never directly worked with the camera myself, I'm not sure what the factory calibration looks like / whether the left/right images are already undistorted/rectified.

Thanks! Do you know how can I configure Kimera Semantics to take the pose and point cloud directly from ZED ros wrapper and avoid using Kimera VIO at all? An example of the launch file would be super useful!

I put together an untested sample launch file that tries to configure both the hydra_topology and hydra_dsg_builder base launch files with as few arguments as possible here. If I get time this week I'll try to replace the arguments with the corresponding uhumans2 information as a sanity check that I didn't miss anything. You may notice that I pointed Hydra to use the uhumans2 configuration parameters in the launch file: these are also a good set of defaults for a new dataset.

You will still need to set up a 2D semantic segmentation network external to the launch file, which is discussed in the first two sections in my response earlier in the issue here . It might also be worth looking also at the pointers in that response to how the files for hydra_topology and hydra_dsg_builder get specialized for uHumans2 (which is maybe now a little bit more useful since I fixed the link to the uhumans2 hydra_topology launch file ๐Ÿ˜ฌ).

Hi thanks a lot for your reply! It helped a lot and I managed to launch semantic mesh building and it works pretty well. Still have to figure out loop closure. So I have a couple of questions:

  1. Will the pipeline for enabling lcd still work as you described here? In case I`m using ZED SDK which runs its own loop closure algorithm and I just specify the frame published by it in hydra ros launch file.

  2. I've noticed that there is no mask for humans in my mesh - humans are meshed. They are correctly colored as humans and their label is added to dynamic_semantic_labels. Should I enbale use_dynamic_mask also?

Below I attach my launch file and csv file with colors remapped (is this the way it`s supposed to be?)

<launch>
    <arg name="sim_time_required" default="false"/>
    <param name="use_sim_time" value="$(arg sim_time_required)"/>

    <!-- pointcloud for hydra: no need to change -->
    <arg name="pointcloud_topic" value="/semantic_pointcloud"/>

    <!-- Replace these with your actual sensor information -->
    <arg name="sensor_frame" default="zed2i_left_camera_optical_frame"/>
    <arg name="depth_topic" default="/zed2i/zed_node/depth/depth_registered"/>
    <!-- These should point to a 2D semantic segmentation image -->
    <arg name="semantic_info_topic" default="semantics/camera_info"/>
    <arg name="semantic_topic" default="semantics/rectified"/>

    <!-- semantic configuration -->
    <arg name="semantic_map_path" default="$(find kimera_semantics_ros)/cfg/ade150_colour_mapping.csv"/>
    <arg name="typology_dir" default="$(find hydra_dsg_builder)/config/d455"/>
    <arg name="typology_config" default="d455_typology.yaml"/>

    <!-- see uhumans2 launch file for how these are used -->
    <arg name="dsg_output_dir" default="$(find hydra_dsg_builder)/output/uhumans2"/>
    <arg name="dsg_output_prefix" default="office"/>

    <!-- good starter rviz file, though topics could be edited for images -->
    <arg name="rviz_dir" default="$(find hydra_dsg_builder)/rviz"/>
    <arg name="rviz_file" default="uhumans2.rviz"/>

    <!-- turns rgb (or 2D semantics) + depth into pointcloud -->
    <include file="$(find hydra_utils)/launch/includes/rgbd_to_pointcloud.xml">
        <arg name="rgb_info_topic" value="$(arg semantic_info_topic)"/>
        <arg name="rgb_topic" value="$(arg semantic_topic)"/>
        <arg name="depth_topic" value="$(arg depth_topic)"/>
        <arg name="pointcloud_topic" value="$(arg pointcloud_topic)"/>
    </include>

    <!-- performs reconstruction and extracts places -->
    <include file="$(find hydra_topology)/launch/hydra_topology.launch" pass_all_args="true">
        <arg name="semantic_color_path" value="$(arg semantic_map_path)"/>
        <arg name="config" value="d455_topology_config.yaml"/>
        <arg name="semantic_config" default="ade150_semantic_config.yaml"/>
        <arg name="config_dir" value="$(find hydra_topology)/config"/>
        <arg name="debug" value="false"/>
    </include>

    <!-- constructs rest of scene graph from reconstruction and places -->
    <include file="$(find hydra_dsg_builder)/launch/dsg_builder.launch" pass_all_args="true">
        <arg name="robot_id" value="0"/>
        <arg name="use_gt_frame" value="true"/>
        <arg name="use_oriented_bounding_boxes" value="true"/>
        <arg name="config_dir" value="$(find hydra_dsg_builder)/config/d455"/>
    </include>

</launch>

Will the pipeline for enabling lcd still work as you described here? In case I`m using ZED SDK which runs its own loop closure algorithm and I just specify the frame published by it in hydra ros launch file.

No, sorry (when I wrote response I forgot that Hydra depends on some unreleased kimera-vio code to actually compute the relative pose of the two keyframes for the vision-based loop closures). However, if you have information from the ZED as to when a loop-closure was detected (and what the relative pose is), then you could publish that as a pose_graph_tools::PoseGraph message (along with the odometry coming from the zed) and avoid using the pose_graph_publisher_node. Hydra should respect the external loop-closures and optimize the resulting pose graph / scene graph / mesh together.

I've noticed that there is no mask for humans in my mesh - humans are meshed. They are correctly colored as humans and their label is added to dynamic_semantic_labels. Should I enbale use_dynamic_mask also?

No, you shouldn't have to do this (that's only an argument for the kimera_semantics_rosbag.launch file and wouldn't have any effect on whether or not the dynamic semantic labels are used). I've confirmed that the dynamic masking should be working for uhumans2 (see the end of #2), so I'd suspect something isn't agreeing between the semantic colors and the actual labels. You can set this param to true in your frontend config to have Hydra publish the pointclouds and you'll also want to set this to all possible semantic labels (i.e. 0-20). You can then look to see what integer label each "semantic color" is being mapped to by subscribing to /incremental_dsg_builder_node/object_mesh_vertices/X where X is the label you want to look at.

Many thanks for your reply, the problem with humans got solved! I`m still unsure about what exactly you meant by:

However, if you have information from the ZED as to when a loop-closure was detected (and what the relative pose is), then you could publish that as a pose_graph_tools::PoseGraph message (along with the odometry coming from the zed) and avoid using the pose_graph_publisher_node.

So would I have to rewrite the pose_graph_publisher_node to construct the pose_graph based on the odometry coming from ZED? I'd greatly appreciate it if you could expand on this further. Also, would it be easier to do LCD in Kimera-PGMO instead of ZED SDK? In this case, I'd only have odometry and depth images coming out of ZED. How can I enable LCD in this case if I provide the BoW? Or is it something that kimera-vio (unreleased) is supposed to do?

I'm glad you were able to resolve the issue with dynamic labels! If you get a chance, adding a brief statement about what was wrong may help other people who are having similar issues (editing your comment above is probably the best way to do this)

Also, would it be easier to do LCD in Kimera-PGMO instead of ZED SDK?

No, Kimera-PGMO doesn't generate loop closure proposals. Hydra and Kimera-PGMO share the same machinery for processing visual loop closures (DsgBackend in Hydra inherits from KimeraPgmoInterface) from an external VIO / SLAM pipeline (like Kimera-VIO).

So would I have to rewrite the pose_graph_publisher_node to construct the pose_graph based on the odometry coming from ZED? I'd greatly appreciate it if you could expand on this further.

My apologies, I made a bad assumption on the part of the zed (that the zed gave you access to information about loop closures that it detected), which definitely made my previous instructions confusing.

More documentation as to how Hydra handles loop closures: The backend (and the frontend) subscribe to /incremental_dsg_builder_node/pose_graph_incremental, which is either published by pose_graph_publisher_node or kimera vio (though the changes for this are not released). Each message on this topic contains two node pose estimates (the most recent and second most recent keyframe pose) and at least one edge (an odometry edge with the relative pose between the most recent and second mot recent keyframe pose). pose_graph_publisher_node is designed to simulate the incremental messages that you would get from kimera by periodically grabbing the current pose of a tf frame, computing the relative pose between that and the last pose to be cached, and then publishing a pose graph message (with the two nodes and an odometric edge from the relative pose). This "odometric" pose graph (i.e. where all the edges are odometric between consecutive keyframes) is the minimum necessary information to get any loop closure detection working with Hydra as the scene graph-based loop closure detection constructs descriptors for the objects and places around each keyframe if enabled (more on that later). The "odometric" pose graph also is also added to the backend factor graph. If the pose message contains additional edges with type LOOPCLOSE instead of ODOM between two non-consecutive keyframes, then Hydra adds a factor to the factor graph representing the loop closure (and then optimizes and corrects the scene graph). However, pose_graph_publisher_node will never publish loop closures (it doesn't have any information to detect them / compute the relative pose between the keyframes). Any loop closure received this way is an "external" loop closure.

Hydra can also detect "internal" loop closures. An odometric pose graph is enough for the scene graph-based loop closures, but the object-based registration (to solve for the relative pose between two keyframes) can be poorly conditioned if you don't have enough objects in the scene or the semantic segmentation is noisy. Hydra's loop closure detector will proceed down the descriptor hierarchy (places, then objects, then BoW, see the paper for exact details). If BoW descriptors are not present, Hydra will skip doing BoW detection and visual registration. To enable BoW based detection, you'd have to publish the BoW descriptor for each keyframe via the pose_graph_tools::BowQuery message to the topic /kimera_vio_ros/bow_query (there's a hardcoded remap in the base hydra launch file). To enable visual registration, you would have to write a service server that handles the service request inside this function. This service takes the id of two previous keyframes and returns the relative pose between them (if it was possible to compute). The code that handles this in kimera-vio is also not released yet.

Note that Hydra will always respect external loop closures, even if internal loop closure detection is enabled. We have switches in kimera to disable detecting loop closures so we don't get duplicate visual loop closures when running with internal loop closures.

All that said, there are four ways to proceed here:

  • Enable scene graph loop closures: This will require no development on your part, but is the least tested of the options (we haven't had much of a reason to run the scene graph loop closure pipeline without BoW descriptors). As long as you have the pose_graph_publisher_node operating (which use_gt_frame:=true will enable), then all you have to do is set enable_dsg_lcd:=true and internal loop closure detection and registration will be enabled. You may also want to set min_glog_level:=0 and verbosity:=2 to see the current status for the internal loop closures.

  • Internal/External vision-based loop closures: This requires a fair amount of development. You'd have to augment the pose graph publisher node to subscribe to rgb images (and optionally depth to help with registration), selecting some subset of them as keyframes (generally we enforce that keyframes occur 0.2 seconds apart). For each keyframe, you'd compute the BoW descriptor and get the current pose at the keyframe's timestamp (instead of periodically looking up the current pose like the pose graph publisher node does now). If you want to provide external loop closures, you'd look through all the previous keyframes to see if any of the BoW vectors match. If they do, you'd solve for the relative pose using feature matching / PnP (or another registration method) and publish it as an loop closure edge in the current pose graph message. If you're trying to provide internal loop closures, you'd publish the BoW vectors, and then have a service server that tries to compute the relative pose between any two previous keyframes via feature matching / PnP (or another registration method).

  • Use another VIO/SLAM pose source: If there's another open-source VIO solution that works with the zed, it's probably pretty feasible to publish a pose graph message with the necessary information for external loop closures. Obviously, I can see why you'd prefer using the pose estimate from the zed directly (and I'm not aware if anyone has a out-of-the-box solution for the zed, but I haven't looked for one myself).

  • Wait for us to release our changes to kimera-vio: Still not sure what the timeline is on the next release for kimera, and this would also require a working kimera-vio for the zed.

Hi @nathanhhughes, thanks for all the great information you provide in this issue. I was just wondering if there were any updates regarding the release of the modified kimera-vio so one could have access to loop closures in hydra? Thank you!

Hi @araujokth , sorry for the delay in responding. It's looking like we'll have the version of Kimera-VIO that we used pushed as a pre-release in the next couple weeks (we're trying to have it out before the IROS submission deadline).

Thanks so much for the reply @nathanhhughes ! Very looking forward for that update and good luck with the IROS submission!

@araujokth (and others) Thank you for your patience! We've added our changes to a pre-release version of Kimera and updated the documentation on the Hydra side here

Fantastic, thanks for the update! I will give it a try!