A flaky test about QosRclcppTestFixture.test_best_available_policies_publisher

https://ci.ros2.org/job/ci_linux/18164/testReport/junit/test_quality_of_service/QosRclcppTestFixture__rmw_cyclonedds_cpp/test_best_available_policies_publisher/

Stacktrace
/home/jenkins-agent/workspace/ci_linux/ws/src/ros2/system_tests/test_quality_of_service/test/test_best_available.cpp:128
Expected equality of these values:
  publishers_info.size()
    Which is: 0
  1u
    Which is: 1

Try to analyze the reason below,

https://github.com/ros2/system_tests/blob/c21d78c8f3386a6081578303439c1c04d5433e29/test_quality_of_service/test/test_best_available.cpp#L110
which just makes sure there is a reader in the data_readers_ of graph

    rcl_wait_for_subscribers
        ...
	common_context->graph_cache.get_reader_count
            __get_count(data_readers_, ...)

and it can't promise that the data_writers_ will be updated immediately.
https://github.com/ros2/system_tests/blob/c21d78c8f3386a6081578303439c1c04d5433e29/test_quality_of_service/test/test_best_available.cpp#L126

publisher->get_publishers_info_by_topic
	Node::get_publishers_info_by_topic
		node_graph_->get_publishers_info_by_topic
			get_info_by_topic(..., rcl_get_publishers_info_by_topic)
				rcl_get_publishers_info_by_topic(..., rmw_get_publishers_info_by_topic)
					rmw_get_publishers_info_by_topic in the DDS
						common_context->graph_cache.get_writers_info_by_topic
						    __get_entities_info_by_topic(data_writers_, ...)

As the data_readers_ and data_writers_ are updated separately,

rmw_cyclonedds/rmw_cyclonedds_cpp/src/rmw_node.cpp

Lines 899 to 905 in 4bec4c8

    
           impl->common.graph_cache.add_entity( 
        
             gid, 
        
             std::string(s->topic_name), 
        
             std::string(s->type_name), 
        
             ppgid, 
        
             qos_profile, 
        
             is_reader);

If the data_writers_ is slowly updated, the flaky test happens.

Any suggestion or idea is welcomed.

diff --git a/rmw_cyclonedds_cpp/src/rmw_node.cpp b/rmw_cyclonedds_cpp/src/rmw_node.cpp
index acfc33d..6e801d0 100644
--- a/rmw_cyclonedds_cpp/src/rmw_node.cpp
+++ b/rmw_cyclonedds_cpp/src/rmw_node.cpp
@@ -903,6 +903,10 @@ static void handle_builtintopic_endpoint(
         ppgid,
         qos_profile,
         is_reader);
+      // printf("handle_builtintopic_endpoint %d %s\n", is_reader, s->topic_name);
+      if (strcmp(s->topic_name, "rt/test_best_available_publisher") == 0) {
+        dds_sleepfor(DDS_MSECS(10));
+      }
     }
     dds_return_loan(reader, &raw, 1);
   }

The above test patch makes the test case fail easily on my machine.

If you want that test to work reliably, maybe calling handle_builtintopic_endpoint in rmw_get_publishers_info_by_topic before checking the graph cache would be an option? If there is no data available (the common case), it'd will take very little time. (Obviously there are a few analogous cases.)

It's either that, or using the introspection functions provided by DDS without relying on ROS' graph cache, like it used to do before the graph cache got introduced. That'd lead to a lot of extra code and, no doubt, new problems in the future. So that'd not be my first choice.

besides, I think adding some sleep would be also reasonable as implementation-agnostic test case? (there are many places to do that.)

I kinda hate being dependent on short sleeps in test cases. That said, we run Cyclone's CI on Azure pipelines with several tests running in parallel, and we seem to get much nastier variations in scheduling than what the ROS 2 CI seems to get.

If a short sleep is sufficient, then that is good enough for a test case, I'd say.

yeah, CI takes long enough time already. If it is stable w/o sleep, that is much better.

calling handle_builtintopic_endpoint in rmw_get_publishers_info_by_topic before checking the graph cache would be an option?

@eboasson, thanks for your suggestion, but I am afraid that it's no guarantee that the data_writers_ can be updated as well.

adding some sleep would be also reasonable

@fujitatomoya, thank you, this can fix the flaky issue(actually, I like doing that), but this kind of code snippet seems no good, users might not like it.

If Logic B can't depend on Logic A, can I just remove Logic A and use another way to make sure Logic B can be successful? What I mean is to remove the https://github.com/ros2/system_tests/blob/c21d78c8f3386a6081578303439c1c04d5433e29/test_quality_of_service/test/test_best_available.cpp#L107-L118, and then use a new utility function with a timeout to call publisher->get_publishers_info_by_topic for what we expect.

calling handle_builtintopic_endpoint in rmw_get_publishers_info_by_topic before checking the graph cache would be an option?
@eboasson, thanks for your suggestion, but I am afraid that it's no guarantee that the data_writers_ can be updated as well.

The way I was looking at is that the built-in topics (DCPSPublication, DCPSSubscription, DCPSParticipant) are published synchronously by Cyclone, and so the DCPSPublication sample describing a newly data writer in the same thread is visible in a reader for DCPSPublication by the time the dds_create_writer call returns. The graph cache update functions are also all synchronous.

Following that line of reasoning, the race must be between the discovery_thread thread and the main thread. You're right that simply processing whatever data is currently available in those built-in topic readers doesn't guarantee anything because it fails to account for whatever that discovery thread is currently doing. But that is, in itself, a manageable problem, it just means synchronising with that thread.

I suspect that's not worth it for a test, but it would be if application code is expected to make the same assumptions as the test. Or, alternatively, you could document there is no guarantee that rmw_get_publishers_info_by_topic (&c.) are always consistent with events that causally preceded it. Or something like that.

I'm just taking another look at this test, and I'm actually wondering if the bug is something slightly different.

In this test, we are instantiating a QosTestPublisher at line 100, and a QosTestSubscriber at line 103. Each of those contain an rclcpp::Publisher and rclcpp::Subscription, respectively, but crucially they do not instantiate them until toggle is called.

So at line 105, we call toggle on the subscriber, which actually instantiates the rclcpp::Subscription and causes the RMW/DDS machinery to run and do discovery. In line 110, we wait for the subscriber with a 5 second timeout, which is enough time in almost all cases for that discovery to have happened.

However, what we do next is suspicious. In line 120, we call publisher->toggle(), which is the first time that we are instantiating the underlying rclcpp::Publisher. And then we are immediately calling publisher->get_publishers_info_by_topic(), which doesn't wait around at all; it just returns what it has right now. So we have not given discovery any time to happen, and update the graph cache (which is what we are looking at here).

So with all of that said, I think the fix here is to have the test attempt to get_publishers_info_by_topic a few times before giving up. That will give discovery time to happen in almost all cases.

What does everybody think of that?

@clalancette

So with all of that said, I think the fix here is to have the test attempt to get_publishers_info_by_topic a few times before giving up. That will give discovery time to happen in almost all cases.

That works, and I be completely fine with that if it is considered acceptable that discovery can lag for events that occurred prior to the query in a causal ordering. (I'm hope I am formulating it sufficiently incomprehensibly to get away with the statement 😂)

It is just that I have found time and again that such asynchronous behaviour even when the creation and the check are on the same thread trips people up and results in flaky application behaviour. This holds even when it is properly documented, because nobody really reads the documentation anyway until they have figured out the problem.

On the other hand, I suspect hardly anyone will ever run into this outside of test cases. And so while it is possible to modify the discovery thread and get_publishers_info_by_topic (& friends) to guarantee you will see the publisher (in this example), it may not be worth the effort.

I'm cool with it either way, all I want to be sure of is that pros and cons are weighed. I kinda like the option of checking a few times in this test (like you propose) and making sure there is a note in the documentation of get_publishers_info_by_topic (& friends) that discovery may be asynchronous and the result need not reflect events immediately, including those caused by the calling thread.

That way, the documentation tries to warn anyone of the trap while leaving open the possibility of giving stronger guarantees in the future, should that turn out to be worth the effort.

Thanks for the response @eboasson .

That works, and I be completely fine with that if it is considered acceptable that discovery can lag for events that occurred prior to the query in a causal ordering. (I'm hope I am formulating it sufficiently incomprehensibly to get away with the statement joy)

I understand what you are saying (and I think this is a reiteration of what you said before, I just didn't understand it then :).

It is just that I have found time and again that such asynchronous behaviour even when the creation and the check are on the same thread trips people up and results in flaky application behaviour. This holds even when it is properly documented, because nobody really reads the documentation anyway until they have figured out the problem.

Yeah, that worries me too. But I would say that this particular issue has always potentially been there.

On the other hand, I suspect hardly anyone will ever run into this outside of test cases. And so while it is possible to modify the discovery thread and get_publishers_info_by_topic (& friends) to guarantee you will see the publisher (in this example), it may not be worth the effort.

We'll have to see. It might be worthwhile to do that, but there is probably a trade-off involved.

I'm cool with it either way, all I want to be sure of is that pros and cons are weighed. I kinda like the option of checking a few times in this test (like you propose) and making sure there is a note in the documentation of get_publishers_info_by_topic (& friends) that discovery may be asynchronous and the result need not reflect events immediately, including those caused by the calling thread.

Good call! I've gone ahead and added in a note to the RMW documentation in ros2/rmw#352

@clalancette @eboasson

cannot agree more.

fix here is to have the test attempt to get_publishers_info_by_topic a few times before giving up. That will give discovery time to happen in almost all cases.

@clalancette are you going to address this fix as well as doc update?

@clalancette are you going to address this fix as well as doc update?

I think @iuhilnehc-ynos already did this in ros2/system_tests#513 . Along with my documentation update in ros2/rmw#352 , I think that should cover everything we talked about here.

	impl->common.graph_cache.add_entity(
	gid,
	std::string(s->topic_name),
	std::string(s->type_name),
	ppgid,
	qos_profile,
	is_reader);