eclipse-cyclonedds/cyclonedds-cxx

Process blocked when destroying dds_reader

Opened this issue · 7 comments

env

dds version: 0.10.3
Ubuntu 20.04.6

bug description

When the program is running, the number of domain_id, participant, etc. is roughly as follows:
domain_id : particpant : topic : reader : writer = 1 : 1 : x : y : z

Recently, in order to save some overhead, we changed some readers from static creation to dynamic creation, and this change test found that when these readers are dynamically destroyed, the program will be stuck.

When using gdb analysis, it is found that the abnormal thread stack information is as follows:
image

Further debugging found that the problem occurred in the gcreq_queue_thread part.
When ddsi_delete_reader internally calls gcreq_reader and executes gcreq_enqueue_common internally, everything is normal.
https://github.com/eclipse-cyclonedds/cyclonedds/blob/0.10.3/src/core/ddsi/src/ddsi_endpoint.c#L1612
https://github.com/eclipse-cyclonedds/cyclonedds/blob/0.10.3/src/core/ddsi/src/q_gc.c#L290

However, inside gcreq_queue_thread, the logic up to here will be executed consistently:
https://github.com/eclipse-cyclonedds/cyclonedds/blob/0.10.3/src/core/ddsi/src/q_gc.c#L173

Finally, it will block at:
https://github.com/eclipse-cyclonedds/cyclonedds/blob/0.10.3/src/core/ddsi/src/q_gc.c#L156

So far, I haven't investigated further, I hope I can get some help from here.

I can provide some more information if needed. Hope to get a reply as soon as possible

@eboasson hey, can you help me to look at this issue?

Hi @YeahhhhLi, can it be that you are deleting the reader from inside a listener?

The problem with that is that I chose to invoke the listeners synchronously from deep inside the code, where the event actually occurs, and that limits the number of things you can safely do inside a listener function. Quite similar to how you can only do a limited set of things inside a signal handler in Unix.

The upside is that this means you get very fast listeners and without any context switches. The downside is this ...

It is simply too easy to fall into the trap. In fact, I did so myself a few times, too ... 😳 eventually I ended up with this. I don't think it'll be useful to you, but the technique would work.

In short, it would be better to make the current behaviour opt-in by generalising that bit of code I referenced and putting it in the core, but right now it is the only behaviour.

can it be that you are deleting the reader from inside a listener?

yes, but I deleted another data_reader in the listener.

As I mentioned above, we hope to dynamically open and close data_reader to reduce performance loss.

We created a dds::DataReader of topic1, and bound dds::Listenner to monitor which other topics have been received and created a Reader corresponding to the topic (the Reader contains dds::DataReader members inside), after creation, it will Immediately call Reader::Disable(), when the data of the topic is really needed, we will enable it. And since DataReader does not have an interface similar to disable / enable, we use construct / deconstruct instead, and we will encounter the above problem.

Quite similar to how you can only do a limited set of things inside a signal handler in Unix.

In my opinion, the limit of listener should not be the same as that of signal handler.

As users, it is a very common requirement for us to satisfy asynchronous operations through listeners, and we are very likely to do many things in on_data_available.

And the unix signal handler does more cleaning work, which is not a common scenario

Before there is a better solution, we can meet our needs in other ways, but I still hope that there will be some better processing inside dds, such as:

  1. As you just mentioned, DataReaderAsyncListener is provided to meet certain usage scenarios, that is, we can handle all our business operations as much as possible inside the listener.
  2. DataReader supports the disable / enable function, and we can flexibly choose whether to monitor the corresponding data according to the actual scenario.

It would be better if there is another solution.

I am aware of that, which is why I believe it should be changed to invoke listeners on a separate thread by default. It doesn't really get you anything you can't do today because it'll be exactly the same behaviour as having a waitset and calling the functions yourself, but it is (or can be, at least in C++ with lambda functions) convenient to the programmer.

Invoking the listeners on a separate thread immediately raises questions about whether to coalesce events, whether to use a single thread or a thread pool, what scheduling parameters to use. But there are reasonable default answers to those questions. Given a few sensible choices, it is not really hard to do it, just a bit of work.

But today it is optimised for people that want to do only limited work in a listener and want that call to be as quick and cheap as possible.

DataReader supports the disable / enable function, and we can flexibly choose whether to monitor the corresponding data according to the actual scenario.

That would still be tricky, what you're really seeing is the interaction between threads during the matching/unmatching of readers and writers. Now that in itself would not be a problem if it didn't serialize listener invocations, but that it does because I foresaw very nasty race conditions in application code if a listener function could be called on multiple threads at the same time.