openhab/org.openhab.binding.zigbee

[Telegesis] Zigbee binding stops working after some time

Closed this issue · 15 comments

Outline

The zigbee binding stops working after some commands are send, afterwards only a java.util.concurrent.RejectedExecutionException: Task com.zsmartsystems.zigbee.dongle.telegesis.ZigBeeDongleTelegesis$2@107f71 rejected from java.util.concurrent.ThreadPoolExecutor occurs. appears.

This was reproducible at least 5 times, to work with the binding again openhab has to be restarted.
If no command is sent, binding runs fine for several days. After a few commands being send the mentioned error occurs.
The commands which are send are processed normally, the lights switch as requested.
Please see logs for more details.

I think this happened already with 2.4.0, but I didn't analyzed the issue there any further as I switched to another gateway.

Configuration

Configuration Description
Coordinator used Qivicon Zigbee USB Stick (Telegesis)
openHAB version 2.5.0-M2
Hardware Synology Diskstation ds213j
Memory 512MB
Java version Java(TM) SE Embedded Runtime Environment (build 1.8.0_121-b13)
Devices 6 Osram RGBW gu10 bulbs, 1 Osram e27 bulb, 1 Osram flex strip

Logs

openhab.log

This is not really a problem as far as I can see. This issue is caused by the binding (or something in the system) sending a command AFTER the system has been shut down. This exception is simply stating that the command is rejected as the system is shutdown.

The bigger question is why the system is shutdown - it should restart, but I don't see that in the log. Maybe it's not implemented for Telegesis.

With your feedback I've analysed the code and I think, the issue is in TelegesisFrameHandler: the variables retries is increased, but never set back to 0 if the transaction was successfully completed.
According to the comment at ACK_TIMEOUTS it should be consecutive fails, but currently it is 2 timeouts during whole run.

If my conclusions are correct, the stopTimer() method should be changed to the following:

    private void stopTimer() {
        if (timeoutTimer != null) {
            logger.trace("TELEGESIS Timer: Stop");
            timeoutTimer.cancel(false);
            timeoutTimer = null;
            retries = 0;
        }
    }

I will try to create a push request with a junit test.

I think you are correct that this is a bug - I recall this in the past and thought it had been fixed, but it seems to have crept back in.

However, I don't think that is the real problem. Your log shows that the system goes OFFLINE but doesn't restart.

2019-08-29 21:04:18.571 [DEBUG] [tsystems.zigbee.ZigBeeNetworkManager] - ZigBeeNetworkManager shutdown: networkState=OFFLINE
2019-08-29 21:04:18.574 [DEBUG] [tsystems.zigbee.ZigBeeNetworkManager] - Network state is updated to SHUTDOWN
2019-08-29 21:04:18.577 [DEBUG] [bee.handler.ZigBeeCoordinatorHandler] - null: networkStateUpdated called with state=SHUTDOWN
2019-08-29 21:04:18.579 [DEBUG] [pp.discovery.ZigBeeNetworkDiscoverer] - Network discovery task: shutdown

So, if you fix the issue you mention, then this will reduce the chance of the problem happening, but it will not solve the issue as the Telegesis coordinator will still not restart after an error.

Have created a pull request for the retry counter issue, I will check if I find the solutions for the not restarting coordinator later on. Do you have any hint, if this should be done in this binding or in the zsmartsystems component?
As I'm not very familiar with the code this would help me a lot.

It should be handled in the binding - in the ZigBeeCoordinatorHandler class there is a reconnection check.

@triller-telekom wrote this if I remember correctly so he may also have some thoughts if you have questions.

I think I found three issues, which prevent the restart, at least one of it should affect all dongles, but I don't know what the exact impact of it is.

If I see it correctly, in case the dongle goes offline, in ZigBeeCoordinatorHandler the method startReconnectJobIfNotRunning() is invoked, which calls networkManager.shutdown(). This has several effects, two of them are:

  • Only for Telegesis: The shutdown() method In ZigBeeDongleTelegesis stops an Executor which is only started when the class is instantiated. As the object ZigBeeDongleTelegesis ist not instantiated again but reused, the Executor rejects the next command.
  • In ZigBeeNetworkManager.shutdown() the com.zsmartsystems.zigbee.internal.NotificationService is also shut down, this is a static call and cannot be undone, meaning the NotificationService cannot be restarted currently. Either it should have a possibility to restart it or this call should not be made.

And another thing which I don't understand currently: In ZigBeeThingHandler the method bridgeStatusChanged() is not called, although an updateStatus(...) must have been called. This currentls leads to a reference to the old channels/networkmanager etc.

For the issue with the Telegesis dongle not being initialised again I may be able to create a pull request, for the ZigBeeNetworkManager.shutdown() I'm not able to see the impacts.

Telegesis handler must not be reused - it must be reinstantiated (ie a new class must be instantiated).

The second one is incorrect since in ZigBeeNetworkManager.initialize() we initialise the NotificationService.

        NotificationService.initialize();

What do you mean by bridgeStatusChanged is not called? This is called by the framework - or do I misunderstand?

My bad, the NotificationService is indeed already initialized, I was looking in the source of the 1.2.1 (included dependency in this binding).

By bridgeStatusChanged is not called I mean, that I don't see the log message "{}: Coordinator status changed to {}" in the logs. But I see the message
[bee.handler.ZigBeeCoordinatorHandler] - null: networkStateUpdated called with state=OFFLINE
which means, that in ZigBeeCoordinatorHandler the method updateStatus(ThingStatus.OFFLINE, ThingStatusDetail.COMMUNICATION_ERROR); has been called. I don't know the openHAB framework that well, but I would assume that, as the ZigBeeCoordinatorHandler is a Bridge thing the bridgeStatusChanged methods of the things should have been called.

But I may be running in the wrong direction, have to take a closer look at it...

First of all: Thank you for finding the issue with the wrong counter in the TelegesisFrameHandler!

Regarding the missing restart. I see the restart in the logs:

2019-08-29 21:04:19.034 [DEBUG] [bee.handler.ZigBeeCoordinatorHandler] - Initialising ZigBee coordinator
2019-08-29 21:04:19.037 [DEBUG] [e.transaction.ZigBeeTransactionQueue] - Default: Set profile to ZigBeeTransactionProfile [maxOutstandingTransactions=2, interTransactionDelay=50, maxRetries=2]
2019-08-29 21:04:19.038 [DEBUG] [e.transaction.ZigBeeTransactionQueue] - Broadcast: Set profile to ZigBeeTransactionProfile [maxOutstandingTransactions=3, interTransactionDelay=1200, maxRetries=0]
2019-08-29 21:04:19.040 [DEBUG] [e.transaction.ZigBeeTransactionQueue] - Multicast: Set profile to ZigBeeTransactionProfile [maxOutstandingTransactions=3, interTransactionDelay=1200, maxRetries=0]
2019-08-29 21:04:19.042 [DEBUG] [bee.handler.ZigBeeCoordinatorHandler] - Mesh Update Period 86400
2019-08-29 21:04:19.043 [DEBUG] [tsystems.zigbee.ZigBeeNetworkManager] - Adding supported cluster 1280
2019-08-29 21:04:19.044 [DEBUG] [com.zsmartsystems.zigbee.ZigBeeNode ] - ClusterMatcher starting
2019-08-29 21:04:19.046 [DEBUG] [com.zsmartsystems.zigbee.ZigBeeNode ] - ClusterMatcher adding cluster 1280
2019-08-29 21:04:19.047 [DEBUG] [tsystems.zigbee.ZigBeeNetworkManager] - Adding supported cluster 25
2019-08-29 21:04:19.048 [DEBUG] [com.zsmartsystems.zigbee.ZigBeeNode ] - ClusterMatcher adding cluster 25
2019-08-29 21:04:19.050 [DEBUG] [tsystems.zigbee.ZigBeeNetworkManager] - ZigBeeNetworkManager initialize: networkState=UNINITIALISED
2019-08-29 21:04:19.051 [DEBUG] [tsystems.zigbee.ZigBeeNetworkManager] - Network state is updated to INITIALISING
...
2019-08-29 21:04:20.799 [DEBUG] [ngle.telegesis.ZigBeeDongleTelegesis] - Telegesis dongle initialize.
2019-08-29 21:04:20.800 [DEBUG] [ding.zigbee.handler.ZigBeeSerialPort] - Connecting to serial port [/dev/ttyUSB1] at 19200 baud, flow control FLOWCONTROL_OUT_NONE.
2019-08-29 21:04:20.916 [DEBUG] [ding.zigbee.handler.ZigBeeSerialPort] - Serial port [/dev/ttyUSB1] is initialized.
2019-08-29 21:04:20.926 [DEBUG] [gesis.internal.TelegesisFrameHandler] - TelegesisFrameHandler thread started

However there is still a problem:

2019-08-29 21:04:21.697 [DEBUG] [gesis.internal.TelegesisFrameHandler] - Exception processing Telegesis frame: TelegesisNetworkJoinedEvent [channel=11, panId=26769, epanId=036A360419D6A7CC]: 
java.util.concurrent.RejectedExecutionException: Task com.zsmartsystems.zigbee.dongle.telegesis.ZigBeeDongleTelegesis$2@1648d5b rejected from java.util.concurrent.ThreadPoolExecutor@14ccfa8[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 8535]
	at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) ~[?:?]
	at com.zsmartsystems.zigbee.dongle.telegesis.ZigBeeDongleTelegesis.telegesisEventReceived(ZigBeeDongleTelegesis.java:739) ~[212:com.zsmartsystems.zigbee.dongle.telegesis:1.2.1]
	at com.zsmartsystems.zigbee.dongle.telegesis.internal.TelegesisFrameHandler.notifyEventReceived(TelegesisFrameHandler.java:459) [212:com.zsmartsystems.zigbee.dongle.telegesis:1.2.1]
	at com.zsmartsystems.zigbee.dongle.telegesis.internal.TelegesisFrameHandler.access$600(TelegesisFrameHandler.java:41) [212:com.zsmartsystems.zigbee.dongle.telegesis:1.2.1]
	at com.zsmartsystems.zigbee.dongle.telegesis.internal.TelegesisFrameHandler$1.run(TelegesisFrameHandler.java:187) [212:com.zsmartsystems.zigbee.dongle.telegesis:1.2.1]

Somehow the executor isn't reset. In ZigbeeDongleTelegesis this is a class variable:

private ExecutorService commandScheduler = ZigBeeExecutors.newFixedThreadPool(1, "TelegesisCommands");

which is NOT reset if we create a new object from that class.

So I guess the fix would be to move the initialization of this commandScheduler executor to the constructor. WDYT?

which is NOT reset if we create a new object from that class.

Hmm - it's unclear to me why the commandScheduler is not reset - it's creating a new object- it's not static - it MUST be "reset" if it's a new object. (right?).

Are you really sure that there is a new handler being created and it's not trying to reuse the old handler? This seems more likely to me? Also, if this is the case, moving the instantiation to the constructor will not fix the problem.

I think instead of working around this issue, and potentially finding another problem further down the line, it is better to understand why this is happening. We've stated that it is not possible to reinitialise a handler after it is shutdown() and I think this must be what is happening?

I'm pretty sure, the ZigBeeDongleTelegesis is not reinstantiated, but reused. The ZigBeeDongleTelegesis is instantiated in TelegesisHandler.initialize(); which is only invoked once.. So perhaps ZigBeeCoordinatorHandler could be extended with an abstract method createTransportTransmit() which creates the instance? This way a new instance could be created.

You are right, I think I have an idea how to fix it. I will look into it today.

I have created a PR where I cleaned up the code and introduced the abstract method mentioned by @yoadey

This should fix the issue with the restarts. Together with the other fix in ZSS this should finally resolve this issue.

Closing this via #564