XKNX/xknx

KNX entities switch from available to unavailable randomly

ericbefr opened this issue · 18 comments

Hello,

First of all, thank you for all the great job on Home Assistant !

I already read and rerere read this post home-assistant/core#59170 but I still have the same issue. I tried to disable all my integrations, to change my installation from a NUC with HA + docker to an other NUC with latest HAOS version but the problem persist. I also tried to disconnect everything on my switch, to change the switch by an other one, to change cables, but nothing to do the KNX integration disconnect randomly.

The only thing I can notice is that pinging my KNX-IP interface from my computer via Ethernet after rebooting all devices and/or disconnecting my Internet box from the switch give better response for a few minutes. In normal running, while pinging the KNX-IP interface I loose one packet every 10 to 30 packets send.

I don’t understand what you mean in “disabling route-back” …

I spent more than the equivalent two complete day to investigate and I’m quite depressed … I generally don’t ask for anything on forums but here I really need help because my KNX installation and other integrations are quite important and this issue triggered some automation randomly (even alarm in the middle of the night … no need to precise that my children and my wife did not miss me☹ ) and get the system unusable.

Thanks in advance,
Eric

Hi 👋!
This doesn't really sound like an issue of xknx.

In normal running, while pinging the KNX-IP interface I loose one packet every 10 to 30 packets send.

Have you tried to change the the IP-Interface?
What kind of interface - and connection - is this?

Hi,

Thanks for your help !

I also thought that it was a network problem ... but all the tests I've done don't prove it. Few month ago I did not have any problems and my network doesn't change last months ... I'm quite lost with this issue !

I also thought about the KNX-IP interface, but when I connect my laptop directly on it without any other device (even not the HA server), there seems to be no problems ! Unfortunately, I do not have an other KNX-IP interface to test it. Mine is a SIEMENS N148/22> tunneling. I juste have my old Lifedomus server which use one channel, HA and sometimes ETS, so max 3/4 channels.

Sometimes it last 10 hours without any disconnections, sometimes KNX entities switch every 2 seconds. Sometimes when I reconnect HA interface (computer or smartphone), HA is rebooting ... It alsa made that with my previous installation (other NUC and other version of HA + portainer > today HA OS on an other NUC device)

PS : it's possible that the problem had started at the beginning of 2022 with one update but it gets worse and worse over time

When you say you loose 3-10 % of all Ping-packets this does scream "network problem" to me 🤷 It doesn't change anything that it worked a year ago - devices can get defective over time.
Another hint for me is that this is the first reported issue of that kind since beginning of 2022 (the ones you linked are older and solved now).

I'm afraid, there is not really anything I can do for you without any logs or tcp-dumps. But even then my knowledge is limited to xknx.

In HA when KNX entities change their state to "unavailable" it means xknx lost the connection.

Oh, and maybe check or change (for a test) the power supply of the interface or even the bus itself.

Yes, of course it screams "network problem" but when HA server is off, there's no problem to join the KNX-IP Interface ... > I'm not 100% sure of that so I'll check once again if pinging the KNX-IP interface when issue appears make a difference between HA server connected and disconnected to the network and I'll keep you informed ; I don't want to say foolery.

At this time, I tried to ping my KNX-IP interface and since 15 min all ping had a response <1-3ms in normal situation (all devices connected, all integrations on HA activated, ...), but all is operational on HA without any disconnections so it's not a significant period. I really don't understand where the problem could be...

The power supply is a 960mA one, and the KNX network without IP supervision had 0% problems.

See https://www.home-assistant.io/integrations/knx/#logs-for-the-knx-integration
But at this point I guess Wireshark / tcpdump the whole traffic to/from the interface could probably reveal better information.

Ok, thanks. I will probably have to configure a port miroring on my switch and dive into Wireshark (I don't use it anymore for years :) ) and the specific logs for knx. I didn't take time for that for the moment.

One this moment, I really think it's a joke because this morning around 10am, I've upgraded my system to the last version 2023.2.3 (did you modify something of KNX in this release? :) ), and since the reboot, I ping the KNX interface without any loss or response time > 3ms. HA is fully operational. Hope that it will last so ! I'll keep you informed.

There haven't been changes in minor releases of 2023 and nothing that could have caused/fixed such issues since ~one year.
Besides that, there are probably about 2000 other users of the knx Integration we would have heard from by now 🤷

Port mirroring sounds like a good idea.

The problem reappears at 6pm ... :(

Here some logs

Logger: xknx.log
Source: /usr/local/lib/python3.10/site-packages/xknx/io/tunnel.py:526
First occurred: 18:20:50 (4 occurrences)
Last logged: 18:21:57

Received TunnellingRequest with sequence number not equal to expected: 14. Discarding frame:
Received TunnellingRequest with sequence number not equal to expected: 62. Discarding frame:
Received TunnellingRequest with sequence number not equal to expected: 6. Discarding frame:
Received TunnellingRequest with sequence number not equal to expected: 49. Discarding frame:

Logger: xknx.log
Source: runner.py:128
First occurred: 18:20:57 (220 occurrences)
Last logged: 18:23:23

Resending the telegram repeatedly failed. Did not receive a TUNNELLING_ACK within 1 second for frame with sequence_counter=10
Error: KNX bus did not respond in time (2.0 secs) to GroupValueRead request for: 5/7/0
Could not sync group address '5/7/0' (R1 - State)
Error: KNX bus did not respond in time (2.0 secs) to GroupValueRead request for: 5/7/2
Could not sync group address '5/7/2' (R3 - State)

Logger: xknx.cemi
Source: runner.py:128
First occurred: 18:23:04 (1 occurrences)
Last logged: 18:23:04

Could not send CEMI frame: Resending the telegram repeatedly failed. Did not receive a TUNNELLING_ACK within 1 second for frame with sequence_counter=10 for <CEMIFrame code="L_DATA_REQ" src_addr="IndividualAddress("15.15.198")" dst_addr="GroupAddress("5/4/24")" flags="1011110011100000" tpci="TDataGroup()" payload="" />

Please upload the whole log file.

Hello,

Yesterday was a good day for my HA ... (no ping loss with the KNX-IP interface, no flutter of KNX entities, ...) but this morning when I opened the app on my phone, it didn't connect to the server. After few minutes, the interface appeard but HA was booting and the KNX integration was down ... I immediately went to see the logs.

Logger: homeassistant.components.recorder.util
Source: components/recorder/util.py:241
Integration: Recorder (documentation, issues)
First occurred: 08:15:55 (1 occurrences)
Last logged: 08:15:55

The system could not validate that the sqlite3 database at //config/home-assistant_v2.db was shutdown cleanly

Logger: homeassistant.components.recorder.util
Source: components/recorder/util.py:577
Integration: Recorder (documentation, issues)
First occurred: 08:15:55 (1 occurrences)
Last logged: 08:15:55

Ended unfinished session (id=521 from 2023-02-10 06:50:26.385141)

Here's the log file.
home-assistant_2023-02-10T07-17-45.846Z_.log

My theory is that something (in connection with statistic / recorder / database) make HA crash. The KNX tunnel is not properly closed and when HA reboots, there's no free channel on the KNX-IP interface. After few tens of minutes, the channel is released and the HA KNX integration can connect again. Is this theory possible ?

Thank you for helping me :)

I'm trying to help, but you make it quite hard. You didn't set the log level to debug for xknx so these logs are not very useful.
These also seem to be the logs from after HA booted again - the ones from when it crashed would be much more interesting. These are usually saved in the configuration directory as home-assistant.log.1 or home-assistant.previous.log or something like that.
You should also deactivate all custom components when debugging.

My theory is that something (in connection with statistic / recorder / database) make HA crash. The KNX tunnel is not properly closed and when HA reboots, there's no free channel on the KNX-IP interface. After few tens of minutes, the channel is released and the HA KNX integration can connect again. Is this theory possible ?

If HA did crash and there is no clean shutdown, then the tunnel can not be released properly. And (especially older interfaces) tend to take a while until releasing unused tunnels. I don't know how many concurrent tunnels your interface supports, but it seems plausible.

Your HA crash and your otherwise loss of KNX connection don't necessarily have to have the same cause, but who knows...

An other thing I had to confirm ... when I loose ping packets from KNX-IP interface, the fact of disconnecting HA server from network seems to solve the problem. Other way to say that, it's the HA server that seems to overload the KNX-IP interface.

I'm trying to help, but you make it quite hard. You didn't set the log level to debug for xknx so these logs are not very useful. These also seem to be the logs from after HA booted again - the ones from when it crashed would be much more interesting. These are usually saved in the configuration directory as home-assistant.log.1 or home-assistant.previous.log or something like that. You should also deactivate all custom components when debugging.

My theory is that something (in connection with statistic / recorder / database) make HA crash. The KNX tunnel is not properly closed and when HA reboots, there's no free channel on the KNX-IP interface. After few tens of minutes, the channel is released and the HA KNX integration can connect again. Is this theory possible ?

If HA did crash and there is no clean shutdown, then the tunnel can not be released properly. And (especially older interfaces) tend to take a while until releasing unused tunnels. I don't know how many concurrent tunnels your interface supports, but it seems plausible.

Your HA crash and your otherwise loss of KNX connection don't necessarily have to have the same cause, but who knows...

I don't know yet how to set log level to debug for xknx but sure I will learn how to do and post the right logs.

My KNX-IP Interface is a SIEMENS N148/22 which is quite old ... It provides up to five KNXnet/IP Tunneling connections, so it should be enough since I only have one HA server and one Lifedomus server that use the interface.

In advance sorry, because I have to left the debug for few days but I'll resume it as soon as possible and will keep you informed.

I don't know yet how to set log level to debug for xknx but sure I will learn how to do and post the right logs.

follow the link 😉

should be enough since I only have one HA server and one Lifedomus server that use the interface

when it is a network / interface problem the second server will probably have the same issues (it may just not expose them to a user, but reconnect silently). So after say the network is stable again, there are already 2 orphaned tunnels on the interface.

Hello,

Some news... when I got back from vacation, I did some tests again. In particular, I redid an install from scratch (except copy-paste of the KNX yaml) on another server. Result: with the 2 servers running with a different config version, the old version continued to crash the server and therefore the connection to the KNX-IP interface, which also impacted the other server with the new config. I was able to see these crashes by following the uptime of each server ...

Today, it runs flawlessly with the new config having disabled the old server. So it was ultimately not a network problem, nor a KNX-IP interface problem but a HA problem (a priori a history and DB problem).

Thank you for your help.
Eric