ros2/rclpy

Subscribers failing to catch publishers in a non-trivial system

Closed this issue · 3 comments

Bug report

Required Info:

  • Operating System:
    • Ubuntu 20.04, Foxy
  • Installation type:
    • Binaries
  • Version or commit hash:
    • Latest debs
  • DDS implementation:
    • Fast-RTPS frequently fails, RTI Connext is fine
  • Client library (if applicable):
    • rclpy

Steps to reproduce issue

  • Install examples-rclpy-minimal-subscriber, examples-rclpy-minimal-publisher
  • In first shell, ./launch_other_nodes.py - this launches 10 pub-sub pairs on different topic names
  • In second shell, ./foo.py - periodically creates a new and destroys the old publisher every 10 seconds
  • As publishers get destroyed and created (/foo_0, /foo_1, ...) try to tune into ros2 topic echo /foo_N

foo.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#
# Publishes every second on whatever publishers are open
# Rclpy spins once per second
# Starts with one publisher - /foo_0
# Destroys/creates every <event_periodicity> seconds (/foo_N)
#
# Usage:
#    When running as it's own node, can always catch `ros2 topic echo /foo_N`
#    When running with `launch_other_nodes.py`, catches `/foo_0`, but rarely others
#    When skipping destruction, can always catch `/foo_N` for every N
#    Increasing the <event periodicity> doesn't make a difference, once lost, always lost
#
import sys

import rclpy
import rclpy.qos
import std_msgs.msg as std_msgs


def qos_profile_latched():
    return rclpy.qos.QoSProfile(
        history=rclpy.qos.QoSHistoryPolicy.RMW_QOS_POLICY_HISTORY_KEEP_LAST,
        depth=1,
        durability=rclpy.qos.QoSDurabilityPolicy.RMW_QOS_POLICY_DURABILITY_TRANSIENT_LOCAL,
        reliability=rclpy.qos.QoSReliabilityPolicy.RMW_QOS_POLICY_RELIABILITY_RELIABLE
    )


def shutdown():
    rclpy.shutdown()
    sys.exit(1)


def main():
    rclpy.init(args=None)
    node = rclpy.create_node(
        node_name="foo",
        start_parameter_services=False
    )
    executor = rclpy.executors.SingleThreadedExecutor()
    executor.add_node(node)
    counter = 0
    tick_counter = 0
    publishers = {}

    def publish_message(publisher):
        nonlocal counter
        msg = std_msgs.String(data="foo - {}".format(counter))
        node.get_logger().info("Publishing [{}]:{}".format(publisher.topic, msg))
        publisher.publish(msg)
        counter += 1

    event_periodicity = 10
    while rclpy.ok():
        # every <event_periodicity> seconds, create a new publisher, destroy the old one
        if (tick_counter % event_periodicity) == 0:
            number = int(tick_counter / event_periodicity)
            previous = "/foo_{}".format(number - 1)
            name = "/foo_{}".format(number)
            node.get_logger().info("Creating Publisher: {}".format(name))
            publishers[name] = node.create_publisher(
                msg_type=std_msgs.String,
                topic=name,
                qos_profile=qos_profile_latched()
            )
            if number != 0:
                node.get_logger().info("Destroying Publisher: {}".format(publishers[previous].topic))
                node.destroy_publisher(publishers[previous])
                del publishers[previous]
        # publish on whatever publishers are instantiated
        tick_counter += 1
        try:
            for unused_name, publisher in publishers.items():
                publish_message(publisher)
            executor.spin_once(timeout_sec=1)
            # rclpy.spin_once(node, timeout_sec=1)
        except KeyboardInterrupt:
            shutdown()


if __name__ == "__main__":
    main()

launch_other_nodes.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import launch
import launch_ros


def generate_launch_description():
    launch_nodes = []
    for i in range(10):
        launch_nodes.append(
            launch_ros.actions.Node(
                package='examples_rclpy_minimal_publisher',
                name="publisher_{}".format(i),
                executable="publisher_old_school",
                output="log",
                remappings=[
                    ('/topic', '/topic_{}'.format(i)),
                ],
                arguments=['--ros-args', '--log-level', 'WARN'],
            )
        )
        launch_nodes.append(
            launch_ros.actions.Node(
                package='examples_rclpy_minimal_subscriber',
                name="subscriber_{}".format(i),
                executable="subscriber_old_school",
                output="log",
                remappings=[
                    ('/topic', '/topic_{}'.format(i)),
                ],
                arguments=['--ros-args', '--log-level', 'WARN'],
            )
        )
    return launch.LaunchDescription(launch_nodes)


if __name__ == "__main__":
    ls = launch.LaunchService()
    ls.include_launch_description(generate_launch_description())
    ls.run()

Expected behavior

  • Tuning into ros2 topic echo /foo_N for whatever N is currently active prints foo - <tick_number> to screen

Actual behavior

  • Always catches /foo_0 but rarely manages to catch on other /foo_N topics, just hangs (which is different again to the almost immediate return to the command line if a publisher is not present).

Additional information

  • Period for publisher destruction, creation is at 10 seconds, but pushing this out further doesn't help (i.e. giving more time for discovery). Once lost, always lost.
  • Dropping publisher destruction - everything is fine
  • Running foo only without the other nodes present - everything is fine (haven't seen a failure yet)
  • Switching to RTI Connext - everything is fine
  • Symptoms are the same if programmatically creating subscribers instead of using ros2cli verbs
  • Symptoms are the same when calling node.count_publishers(topic_name), i.e. always returns zero
  • This seems to be a regression in foxy, these scripts work (once node_name, node_executable is refactored into the launch script) fine in eloquent.

Haven't drilled down further yet, just managed to isolate it from my own code into this reproduceable example. Let me know if there's anything I can do to help.

Hm, interesting. I just tried this out on both master (built from source) and Foxy, and I can't seem to make it fail with Fast-DDS. In my testing here, either I get:

ubuntu@ros2-focal:~$ ros2 topic echo /foo_5
WARNING: topic [/foo_5] does not appear to be published yet
Could not determine the type for the passed topic

(if I miss the topic), or I get the data that is supposed to be published to the topic:

ubuntu@ros2-focal:~$ ros2 topic echo /foo_13
data: foo - 132
---
data: foo - 133
---
data: foo - 134
---
data: foo - 135
---
data: foo - 136
---
data: foo - 137
---
data: foo - 138
---
data: foo - 139
---

I wonder what is different between our systems. @stonier are you up-to-date on the latest Foxy packages? Also, are you running inside of a container, or on bare-metal, or something else?

System: portable pc (heavy duty laptop, @clalancette you know the one :P)

Debs

They were up to date at the time. I just ran the tests prior to updating them again just to confirm I have the problem. Then updated all my ros-foxy-* dependencies. Problem, no longer apparent (at least given a few trials with these scripts and a couple of other launchers I have).

Possibly, we can close this since we can't reproduce any longer unless someone else has a breadcrumb. I wish I had taken a diff of the package versions - if the problem does reappear we could have done some sleuthing.

So, there have only been two versions of Fast-{RTPS,DDS} released into Foxy so far: 2.0.0 (what Foxy was originally released with), and 2.0.1 (what it is at currently). When running with 2.0.1 (which is what I tested with yesterday), I don't see the problem. When I downgraded to 2.0.0, I now see the problem.

So I'd say that this is probably fixed with the newer version of Fast-{RTPS,DDS}. I'm going to close this out, but feel free to reopen if you see the problem again. Thanks for the report.