COVESA/dlt-daemon

DLT-Multinode: Gateway does not recognize reset of passive node

Closed this issue · 13 comments

Precondition:

  1. Configure a multinode setup like described here
  2. Passive and Gateway node are up and running, Gateway has connected to Passive Node, DLT logs of Passive Node can be received at client connected to GatewayNode

Action:

  • Reset Gateway Node without proper shutdown (i. e. do not stop dlt-daemon)

Expected behaviour:

  • Gateway Node recognizes unavailability of Passive node and starts reconnecting

Observed behavior:

  • Gateway Node does not recognize unavailability of Passive node and no reconnection attempt is started

Traces with Verbose=1 and LoggingLevel=7 are attached: dlt-log.zip

DLT Package Version: 2.18.9 STABLE, Package Revision: v2.18.9

Might be solved here: #537

I will test it...

Might be solved here: #537

I will test it...

This does not solve the issue.

Hello @marc-heinlein ,
thanks for your report.
I have one questions:
"Reset Gateway Node without proper shutdown (i. e. do not stop dlt-daemon)"
Do you mean reset of Gateway Node or Passive Node here?
I would assume on reset on Gateway Node, it would try reconnecting after reboot (as dlt-daemon will be starting again).

Hello @michael-methner, the issue is when resetting the passive node. When resetting the gateway node, everything works like expected.

Hi all,
As I remember, this is an expected behavior.
The connection between gateway and passive nodes are established at starting.
Gateway defines some retrials at beginning and timeout there to connect to passive.
After that, there won't be any effort to reconnect passive nodes.

You can try to trigger an establishment from gateway to passive via dlt-passive-node-ctrl.

Hmm, for me, it sounds strange, that the passive node should actively request at the gateway node to reconnect to passive node. If this is the forseen strategy for the multinode configuration, it should be built into the passive node, that it automatically (re-) requests the connection once the gateway node does not connect until a certain timeout (i. e. if the passive node restarts after a reset).

Related to dlt-passive-node-ctrl: How could I call this on passive node side? I can not see any argument, that allows me to specify on which IP address the gateway node is located. Anyhow, this approach sounds to me like "create an external supervisor, for 'gateway-to-passive' connections".

Overall, it would be more straight forward for the gateway node to just trigger reconnection once it detects, that the passive node is not there anymore (maybe a cyclic control message could be used for such an alive check).

Hi

dlt-passive-node-ctrl should be performed at gateway node not at passive.
you can check with more options by using "-h"

Regarding the automatically reconnect attempt from passive, from my point of view I think it is not reasonable.
The passive node here is considering gateway node as a dlt-client.
t means dlt-daemon won't ever perform reconnect to an dlt-client. Only in reverse direction, dlt-client tries to reconnect to dlt-daemon.

"create an external supervisor, for 'gateway-to-passive' connections". <== I would say yes for this point.

Why not just letting the gateway node perform a (re-)connect as long as it detects, that it can not communicate with the passive node? This should be done independent of the reason of being disconnected:

  • (un-)gracefull shutdown of the passive node
  • network outage
  • not yet connected during startup

Close due to being not a bug but specific DLT mechanism for gateway <-> passive node

Unfortunately this issue has been closed considering to be a desired behaviour, that the gateway might loose the connection to a passive node without triggering any reconnection attempt.

This somehow renders the DLT multinode approach for losely coupled nodes completely useless...

Adding some external supervisor on the gateway node, that triggers the gateway to reconnect is a bad joke.

Unfortunately this issue has been closed considering to be a desired behaviour, that the gateway might loose the connection to a passive node without triggering any reconnection attempt.

This somehow renders the DLT multinode approach for losely coupled nodes completely useless...

Adding some external supervisor on the gateway node, that triggers the gateway to reconnect is a bad joke.

Hi,

In general, all features are very welcome and DLT maintainers are also happy and willing to review and
support those ones.
We maybe continue this topic in Discussiosn since it is actually from analysis and discussion so far, the system is working as expected. So this issue is not an actual issue => it should be closed.
I believe if you already had your own improvement/ design for this use case, DLT maintainer team and community can together review and merge it, of course if it helps us ;)

Happy coding
Thanh

Hello all,
Due to what I test on my local, I wonder is there any misuderstanding or wrong test steps here.
My testing process:

Docker: ubuntu20
IPv4: 192.168.2.1
docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet **192.168.2.1**  netmask 255.255.255.0  broadcast 192.168.2.255
        inet6 fe80::42:59ff:fe9d:91ee  prefixlen 64  scopeid 0x20<link>
        inet6 2001:db8::1  prefixlen 64  scopeid 0x0<global>
        inet6 fe80::1  prefixlen 64  scopeid 0x20<link>
        ether 02:42:59:9d:91:ee  txqueuelen 0  (Ethernet)
        RX packets 183194  bytes 444315574 (444.3 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 139330  bytes 444062532 (444.0 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Sorry if there is static IPv6, I am testing IPv6 adaption for gateway.

dlt_gateway.conf 

[General]
; Time interval for reconnection to passive Node in second.
; Interval=1
[PassiveNode1]
IPaddress=192.168.2.1
Port=3495
EcuID=DOCK
Timeout=10

On Docker:

dlt-daemon -p 3495 -c ~/dlt.conf
dlt.conf: EcuID=DOCK
[35954.848345]~DLT~ 2413~NOTICE   ~Starting DLT Daemon; DLT Package Version: 2.18.10 STABLE, Package Revision: v2.18.10_12_gf74e548, build on Nov  6 2023 14:30:55
-SYSTEMD -SYSTEMD_WATCHDOG -TEST -SHM

[35954.848398]~DLT~ 2413~INFO     ~FIFO size: 65536
[35954.848404]~DLT~ 2413~WARNING  ~Unable to set send timeout Socket operation on non-socket.
[35954.848405]~DLT~ 2413~INFO     ~Activate connection type: 5
[35954.848413]~DLT~ 2413~INFO     ~dlt_daemon_socket_open: Socket created
[35954.848417]~DLT~ 2413~INFO     ~dlt_daemon_socket_open: Listening on ip 0.0.0.0 and port: 3495
[35954.848421]~DLT~ 2413~INFO     ~dlt_daemon_socket_open: Socket send queue size: 16384
[35954.848430]~DLT~ 2413~INFO     ~Activate connection type: 1
[35954.848449]~DLT~ 2413~INFO     ~Activate connection type: 9
[35954.848454]~DLT~ 2413~INFO     ~Cannot open configuration file: /tmp/dlt-runtime.cfg
[35954.848456]~DLT~ 2413~INFO     ~Ringbuffer configuration: 500000/10000000/500000
[35954.848566]~DLT~ 2413~NOTICE   ~Failed to open ECU Software version file.
[35954.848575]~DLT~ 2413~WARNING  ~Unable to set send timeout Socket operation on non-socket.
[35954.848576]~DLT~ 2413~INFO     ~Activate connection type: 6
[35954.848577]~DLT~ 2413~INFO     ~Switched to buffer state for socket connections.
[35954.848580]~DLT~ 2413~WARNING  ~dlt_daemon_applications_load: cannot open file /tmp/dlt-runtime-application.cfg: No such file or directory
[35977.690780]~DLT~ 2413~INFO     ~Activate connection type: 2
[35977.690825]~DLT~ 2413~INFO     ~Switched to send buffer state for socket connections.
[35977.690835]~DLT~ 2413~INFO     ~Switched to send direct state.
^C[35981.206197]~DLT~ 2413~NOTICE   ~Exiting DLT daemon due to signal: Interrupt
[35981.206322]~DLT~ 2413~NOTICE   ~Exiting DLT daemon... [2]
[35981.206519]~DLT~ 2413~NOTICE   ~Leaving DLT daemon
root@Station:/home/lum3hc/work/docker/dlt-daemon/build# dlt-daemon -p 3495 -c ~/dlt.conf 
[35984.769587]~DLT~ 2414~NOTICE   ~Starting DLT Daemon; DLT Package Version: 2.18.10 STABLE, Package Revision: v2.18.10_12_gf74e548, build on Nov  6 2023 14:30:55
-SYSTEMD -SYSTEMD_WATCHDOG -TEST -SHM

[35984.769641]~DLT~ 2414~INFO     ~FIFO size: 65536
[35984.769651]~DLT~ 2414~WARNING  ~Unable to set send timeout Socket operation on non-socket.
[35984.769655]~DLT~ 2414~INFO     ~Activate connection type: 5
[35984.769663]~DLT~ 2414~INFO     ~dlt_daemon_socket_open: Socket created
[35984.769667]~DLT~ 2414~INFO     ~dlt_daemon_socket_open: Listening on ip 0.0.0.0 and port: 3495
[35984.769671]~DLT~ 2414~INFO     ~dlt_daemon_socket_open: Socket send queue size: 16384
[35984.769676]~DLT~ 2414~INFO     ~Activate connection type: 1
[35984.769697]~DLT~ 2414~INFO     ~Activate connection type: 9
[35984.769703]~DLT~ 2414~INFO     ~Cannot open configuration file: /tmp/dlt-runtime.cfg
[35984.769705]~DLT~ 2414~INFO     ~Ringbuffer configuration: 500000/10000000/500000
[35984.769861]~DLT~ 2414~NOTICE   ~Failed to open ECU Software version file.
[35984.769873]~DLT~ 2414~WARNING  ~Unable to set send timeout Socket operation on non-socket.
[35984.769874]~DLT~ 2414~INFO     ~Activate connection type: 6
[35984.769875]~DLT~ 2414~INFO     ~Switched to buffer state for socket connections.
[35984.769879]~DLT~ 2414~WARNING  ~dlt_daemon_applications_load: cannot open file /tmp/dlt-runtime-application.cfg: No such file or directory
[35985.691865]~DLT~ 2414~INFO     ~Activate connection type: 2
[35985.691914]~DLT~ 2414~INFO     ~Switched to send buffer state for socket connections.
[35985.691927]~DLT~ 2414~INFO     ~Switched to send direct state.
[35992.951236]~DLT~ 2414~NOTICE   ~Send log-level to context: LOG:TEST [-1 -> 4] [-1 -> 0]
[35992.951256]~DLT~ 2414~NOTICE   ~Send log-level to context: LOG:TS1 [-1 -> 4] [-1 -> 0]
[35992.951269]~DLT~ 2414~NOTICE   ~Send log-level to context: LOG:TS2 [-1 -> 4] [-1 -> 0]
^C[35995.745765]~DLT~ 2414~NOTICE   ~Exiting DLT daemon due to signal: Interrupt

On Host:

dlt-daemon
dlt.conf: Gateway=1
[35977.690358]~DLT~154400~NOTICE   ~Starting DLT Daemon; DLT Package Version: 2.18.10 STABLE, Package Revision: v2.18.10_12_gf74e548, build on Nov  6 2023 14:30:41
-SYSTEMD -SYSTEMD_WATCHDOG -TEST -SHM

[35977.690404]~DLT~154400~INFO     ~FIFO size: 65536
[35977.690414]~DLT~154400~WARNING  ~Unable to set send timeout Socket operation on non-socket.
[35977.690426]~DLT~154400~INFO     ~Activate connection type: 5
[35977.690433]~DLT~154400~INFO     ~dlt_daemon_socket_open: Socket created
[35977.690438]~DLT~154400~INFO     ~dlt_daemon_socket_open: Listening on ip 0.0.0.0 and port: 3490
[35977.690440]~DLT~154400~INFO     ~dlt_daemon_socket_open: Socket send queue size: 16384
[35977.690446]~DLT~154400~INFO     ~Activate connection type: 1
[35977.690461]~DLT~154400~INFO     ~Activate connection type: 9
[35977.690465]~DLT~154400~INFO     ~Cannot open configuration file: /tmp/dlt-runtime.cfg
[35977.690467]~DLT~154400~INFO     ~Ringbuffer configuration: 500000/10000000/500000
[35977.690582]~DLT~154400~NOTICE   ~Failed to open ECU Software version file.
[35977.690594]~DLT~154400~WARNING  ~Unable to set send timeout Socket operation on non-socket.
[35977.690595]~DLT~154400~INFO     ~Activate connection type: 6
[35977.690608]~DLT~154400~WARNING  ~Entry does not exist in section: Interval
[35977.690609]~DLT~154400~WARNING  ~Using default for Interval.
[35977.690613]~DLT~154400~WARNING  ~Entry does not exist in section: Connect
[35977.690614]~DLT~154400~WARNING  ~Using default for Connect.
[35977.690615]~DLT~154400~WARNING  ~Entry does not exist in section: SendControl
[35977.690616]~DLT~154400~WARNING  ~Using default for SendControl.
[35977.690617]~DLT~154400~WARNING  ~Entry does not exist in section: SendPeriodicControl
[35977.690618]~DLT~154400~WARNING  ~Using default for SendPeriodicControl.
[35977.690619]~DLT~154400~WARNING  ~Entry does not exist in section: SendSerialHeader
[35977.690620]~DLT~154400~WARNING  ~Using default for SendSerialHeader.
[35977.690712]~DLT~154400~INFO     ~Activate connection type: 11
[35977.690720]~DLT~154400~WARNING  ~Unable to set send timeout Socket operation on non-socket.
[35977.690721]~DLT~154400~INFO     ~Activate connection type: 12
[35977.690722]~DLT~154400~INFO     ~Switched to buffer state for socket connections.
[35977.690725]~DLT~154400~WARNING  ~dlt_daemon_applications_load: cannot open file /tmp/dlt-runtime-application.cfg: No such file or directory
[35981.206456]~DLT~154400~WARNING  ~Connection to passive node lost
[35981.206482]~DLT~154400~WARNING  ~Try to reconnect.
[35981.206485]~DLT~154400~WARNING  ~Connection to passive node lost
[35981.206489]~DLT~154400~INFO     ~Deactivate connection type: 11
[35981.691052]~DLT~154400~ERROR    ~dlt_client_connect: ERROR: failed to connect! Connection refused
[35982.691360]~DLT~154400~ERROR    ~dlt_client_connect: ERROR: failed to connect! Connection refused
[35983.691324]~DLT~154400~ERROR    ~dlt_client_connect: ERROR: failed to connect! Connection refused
[35984.692103]~DLT~154400~ERROR    ~dlt_client_connect: ERROR: failed to connect! Connection refused
[35985.691778]~DLT~154400~INFO     ~Activate connection type: 11
[35989.840994]~DLT~154400~INFO     ~Activate connection type: 2
[35989.841032]~DLT~154400~INFO     ~Switched to send buffer state for socket connections.
[35989.841063]~DLT~154400~INFO     ~Switched to send direct state.
[35995.746021]~DLT~154400~WARNING  ~Connection to passive node lost
[35995.746044]~DLT~154400~WARNING  ~Try to reconnect.
[35995.746046]~DLT~154400~WARNING  ~Connection to passive node lost
[35995.746050]~DLT~154400~INFO     ~Deactivate connection type: 11
[35996.691440]~DLT~154400~ERROR    ~dlt_client_connect: ERROR: failed to connect! Connection refused
[35997.692559]~DLT~154400~ERROR    ~dlt_client_connect: ERROR: failed to connect! Connection refused
[35998.698086]~DLT~154400~ERROR    ~dlt_client_connect: ERROR: failed to connect! Connection refused
[35999.692771]~DLT~154400~ERROR    ~dlt_client_connect: ERROR: failed to connect! Connection refused
[36000.694695]~DLT~154400~ERROR    ~dlt_client_connect: ERROR: failed to connect! Connection refused
[36001.691924]~DLT~154400~ERROR    ~dlt_client_connect: ERROR: failed to connect! Connection refused
[36002.695804]~DLT~154400~INFO     ~Activate connection type: 11

Result:

2023/11/06 14:32:12.308735  360164658 000 ECU1 DA1- DC1- control response N 1 [service(3842), ok, 02 00 00 00 00]
2023/11/06 14:32:12.308769  360089391 000 ECU1 DLTD INTM log info V 1 [Daemon launched. Starting to output traces...]
2023/11/06 14:32:12.308773  360089391 000 DOCK DA1- DC1- control response N 1 [service(3842), ok, 02 00 00 00 00]
2023/11/06 14:32:12.308777  360053489 000 DOCK DA1- DC1- control response N 1 [service(3842), ok, 01 00 00 00 00]
2023/11/06 14:32:12.308780  360053489 003 DOCK DLTD INTM log info V 1 [Client connection #8 closed. Total Clients : 0]
2023/11/06 14:32:12.308783  360054891 025 DOCK LOG- TEST log warn V 2 [25 Hello]
2023/11/06 14:32:12.308785  360059898 026 DOCK LOG- TEST log warn V 2 [26 Hello]
2023/11/06 14:32:12.308789  360064919 027 DOCK LOG- TEST log warn V 2 [27 Hello]
2023/11/06 14:32:12.308792  360070030 028 DOCK LOG- TEST log warn V 2 [28 Hello]
2023/11/06 14:32:12.308794  360075045 029 DOCK LOG- TEST log warn V 2 [29 Hello]
2023/11/06 14:32:12.308799  360080143 030 DOCK LOG- TEST log warn V 2 [30 Hello]
2023/11/06 14:32:12.308802  360085211 031 DOCK LOG- TEST log warn V 2 [31 Hello]
2023/11/06 14:32:12.308805  360089392 004 DOCK DLTD INTM log info V 1 [New client connection #8 established, Total Clients : 1]
2023/11/06 14:32:12.308807  360090227 032 DOCK LOG- TEST log warn V 2 [32 Hello]
2023/11/06 14:32:12.308809  360095242 033 DOCK LOG- TEST log warn V 2 [33 Hello]
2023/11/06 14:32:12.308812  360100249 034 DOCK LOG- TEST log warn V 2 [34 Hello]
2023/11/06 14:32:12.308814  360105258 035 DOCK LOG- TEST log warn V 2 [35 Hello]
2023/11/06 14:32:12.308817  360110261 036 DOCK LOG- TEST log warn V 2 [36 Hello]
2023/11/06 14:32:12.308819  360115272 037 DOCK LOG- TEST log warn V 2 [37 Hello]
2023/11/06 14:32:12.308821  360120278 038 DOCK LOG- TEST log warn V 2 [38 Hello]
2023/11/06 14:32:12.308825  360125286 039 DOCK LOG- TEST log warn V 2 [39 Hello]
2023/11/06 14:32:12.308827  360130294 040 DOCK LOG- TEST log warn V 2 [40 Hello]
2023/11/06 14:32:12.308829  360135313 041 DOCK LOG- TEST log warn V 2 [41 Hello]
2023/11/06 14:32:12.308832  360140319 042 DOCK LOG- TEST log warn V 2 [42 Hello]
2023/11/06 14:32:12.308834  360145331 043 DOCK LOG- TEST log warn V 2 [43 Hello]
2023/11/06 14:32:12.308836  360150336 044 DOCK LOG- TEST log warn V 2 [44 Hello]
2023/11/06 14:32:12.308839  360155349 045 DOCK LOG- TEST log warn V 2 [45 Hello]
2023/11/06 14:32:12.308841  360160351 046 DOCK LOG- TEST log warn V 2 [46 Hello]
2023/11/06 14:32:12.308843  360164658 001 ECU1 DLTD INTM log info V 1 [New client connection #9 established, Total Clients : 1]
2023/11/06 14:32:12.379276  360165359 047 DOCK LOG- TEST log warn V 2 [47 Hello]
2023/11/06 14:32:12.880149  360170370 048 DOCK LOG- TEST log warn V 2 [48 Hello]
2023/11/06 14:32:13.386817  360175436 049 DOCK LOG- TEST log warn V 2 [49 Hello]

Same for the reverse way. My conclusion is that no matter how to disable the Gateway, or the Passive node, it will recover.
The only case that I could observe is when the timeout, I mean Gateway will give up on retrying connection to passvie node, not the reverse way when to reset the Gateway!

[36630.540745]~DLT~154706~WARNING  ~Connection to passive node lost
[36630.540772]~DLT~154706~WARNING  ~Try to reconnect.
[36630.540775]~DLT~154706~WARNING  ~Connection to passive node lost
[36630.540776]~DLT~154706~INFO     ~Deactivate connection type: 11
[36631.006885]~DLT~154706~ERROR    ~dlt_client_connect: ERROR: failed to connect! Connection refused
[36632.014512]~DLT~154706~ERROR    ~dlt_client_connect: ERROR: failed to connect! Connection refused
[36633.014660]~DLT~154706~ERROR    ~dlt_client_connect: ERROR: failed to connect! Connection refused
[36634.009557]~DLT~154706~ERROR    ~dlt_client_connect: ERROR: failed to connect! Connection refused
[36635.012678]~DLT~154706~ERROR    ~dlt_client_connect: ERROR: failed to connect! Connection refused
[36636.003903]~DLT~154706~ERROR    ~dlt_client_connect: ERROR: failed to connect! Connection refused
[36636.591970]~DLT~154706~INFO     ~Activate connection type: 10
[36636.592256]~DLT~154706~INFO     ~Deactivate connection type: 10
[36636.592267]~DLT~154706~INFO     ~Switched to buffer state for socket connections.
[36637.013572]~DLT~154706~ERROR    ~dlt_client_connect: ERROR: failed to connect! Connection refused
[36638.004540]~DLT~154706~ERROR    ~dlt_client_connect: ERROR: failed to connect! Connection refused
[36638.194878]~DLT~154706~INFO     ~Activate connection type: 10
[36638.195058]~DLT~154706~INFO     ~Deactivate connection type: 10
[36638.195063]~DLT~154706~INFO     ~Switched to buffer state for socket connections.
[36639.004548]~DLT~154706~ERROR    ~dlt_client_connect: ERROR: failed to connect! Connection refused
[36640.005946]~DLT~154706~ERROR    ~dlt_client_connect: ERROR: failed to connect! Connection refused
[36641.016445]~DLT~154706~ERROR    ~dlt_client_connect: ERROR: failed to connect! Connection refused
[36641.016461]~DLT~154706~WARNING  ~Passive Node connection retry timed out. Give up.
[36661.969746]~DLT~154706~INFO     ~Activate connection type: 10
[36661.969972]~DLT~154706~INFO     ~Deactivate connection type: 10
[36661.969984]~DLT~154706~INFO     ~Switched to buffer state for socket connections.
[36670.876477]~DLT~154706~INFO     ~Activate connection type: 10
[36670.876758]~DLT~154706~INFO     ~Deactivate connection type: 10

This is the result so far I check, please kindly check again.
To summarize:

Gateway reset: still reconnect (new connect in fact)
Passive node reset: still reconnect, but only within the timeout interval
No retry: timeout, Gateway giveup.

Please notice that dlt is up to v2.18.10.
Using v.9 might not perform the expected behavior!