eclipse-cyclonedds/cyclonedds

Galactic CycloneDDS 0.8 check_whc Assertion failed crash

NamanMM opened this issue · 3 comments

Hello everyone,

I have been getting the following failure (In the callback function of the node) resulting in the crash. It is sporadic and slightly tricky to reproduce but here is the error message:

/tmp/binarydeb/ros-galactic-cyclonedds-0.8.0/src/core/ddsc/src/dds_whc.c:284: check_whc: Assertion `whc->maxseq_node->next_seq == NULL' failed.

Any pointers or suggestion will be appreciated.
Thank you in advance.

Hi @NamanMM, it doesn't immediately remind me of a fix that went in and I also don't see any change that seems obviously related. Therefore I have to assume it is still present in Cyclone and that makes it a "very serious bug" indeed.

To start with: anything interesting in the circumstances around the time of the crash? Like creating new subscriptions or deleting them, are you publishing as fast as possible, is a lifespan configured (I am not sure this QoS setting is mapped in ROS 2 from memory). Is it a keep-last or a keep-all writer?

Secondly, this is in check_whc, which implements a bunch of sanity checks on the contents of the writer history cache. It would be interesting to know what operation triggered it. Any chance you could get a stack trace?

Hey @eboasson , Thank you for such a quick response.

Just to give some more context, it is an image segmentation callback function which is invoked when there is an image segmentation message (MaskRCNN detection on the image - the message contains ROS2 messages like sensor_msgs/RegionOfInterest[], sensor_msgs/Image[], etc.) on the topic.
Then, in this callback function, we use the MaskRCNN box along with the point cloud, etc. to publish a message which has a bunch of int and float values.
Most likely, the failure happens once it enters this callback function because the first INFO message is printed (although not guaranteed since the message printing doesn't happen in a sequence).
It is a keep-last writer.

Since the error is sporadic and not reproducible, I will try to get more information when it crashes the next time.

I hope this helps; Thank you.

Hi @NamanMM, just wanted to let you know I have been looking at it but I haven't reproduced it yet. I do have a vague sense of unease on one point and if you are building from source you could try to see if the problem disappears with a small change in the code. I don't know if that'd be worth it, nor how long it would take to have confidence that the change had an effect.

I suspect you end up in the noidx variant of the code for removing messages that have been acknowledged by all readers because ROS 2 doesn't use keys, sensor data is typically "volatile" (so not retaining any history for late joining readers) and "lifespan" is rarely used. That's a slightly faster way of dropping that data than using the full variant, but optimisations have a habit of introducing problems by themselves. Changing

if (whc->wrinfo.idxdepth == 0 && !whc->wrinfo.has_deadline && !whc->wrinfo.is_transient_local)
cnt = whc_default_remove_acked_messages_noidx (whc, max_drop_seq, deferred_free_list);
else
cnt = whc_default_remove_acked_messages_full (whc, max_drop_seq, deferred_free_list);

to always call whc_default_remove_acked_messages_full might therefore fix it. At the very least it would provide evidence for or against my suspicion that the noidx version has a subtle problem.