eclipse-iceoryx/iceoryx

Add flag to RouDi to disable killing connected processes

Closed this issue · 4 comments

Brief feature description

Add flag to iox-roudi, (-d, --do-not-kill) which would disable sending SIGETERM and SIGKILL during shutdown.

Detailed information

Currently, when iox-roudi is shutting down it kills all processes known to it.
I've encountered a situation in which connection with RouDi is only a part of the process lifetime and I definitely don't want it to be killed if RouDi shuts down.

From what I understand it shouldn't be much work since there is already a flag in the configs that can be used to disable this. It should suffice to add a cli flag and push it through.

@kozakusek can you share some details of your use case?

The experimental Node API might be interesting for you

This API does not use a static runtime and when the Node is destroyed (all endpoints need to be destroyed beforehand) it de-registers from RouDi and RouDi can be shut down. Later you can re-start RouDi and create new nodes.

Currently, only Publisher, Subscriber and the WaitSet is available with the experimental API. If you also need Server, Client and Listener, it isn't too complicated to add them. You can either try to add it yourself, wait till someone else implements it (it is on my quite long todo list) or ask these new guyes from ekxide to do it ;)

I have long running processes (days+) that might only want to communicate during start-up (or some catch-up) and basically go on with their life. In the meantime I want to restart the daemon to for example change the config.

The other thing: I want to have different roudi configs for different tests in one junit suite..

Those are not major issues, however the flag whould be a nice thing to have, especially when there is already a filed in the config that allows this behaviour

In this case the new experimental API is exactly what you need. You can shut down RouDi once all the nodes are destructed, restart RouDi with a different config and create new nodes to connect to the new RouDi. While multiple RouDi can run in parallel, it is not yet possible to have nodes connected to multiple RouDis in the same process at the same time. This is only possible in tests. Here you can check how to use multiple RouDi in the tests.
https://github.com/eclipse-iceoryx/iceoryx/blob/master/iceoryx_posh/test/integrationtests/test_posh_experimental_node.cpp#L570

I don't like to expose this option to the default RouDi since it only partially does what you need and it might result in unexpected behavior. For example, if you would restart RouDi and run a process which wants to subscribe to a publisher which was started with the previous RouDi, this would not be possible, leaving the users confused and finally bug tickets would appear on this repo.

Furthermore, since the long running process would not destruct the static runtime, the shared memory would not be released since there are still open file descriptors. The files in /dev/shm will be removed but the memory will not be obtained. You can test this with pkill -9 iox-roudi and deleting all iox_ files from /dev/shm while your process is running. The communication via iceoryx continuous to work, one just cannot subscribe to or unsubscribe from existing connections but allocating chunks and publishing still works. This means you would run out of memory sooner or later since the runtimes keep the memory mapped.

If you do not know when all nodes would be destroyed and therefore also not when to stop RouDi, there are multiple options

  • monitoring app
    • subscribe to the process introspection topic
    • when there are no other processes listed except roudi and your monitoring app, destruct the node in the monitoring app
    • send a sigterm to the RouDi PID
  • add option to let RouDi shut down automatically once the last process unregisters
    • in #970 we had the idea to let applications start RouDi if it is not running and RouDi would shut down when there are no registered processes are left
    • this would be the ideal solution and also should not take too much time to implement

Thanks for the explanation and bringing up the issues that might arise from this change.
I will try the Node API and if it doesn't work out I will just think of some local workarounds.